All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/18] vfio-user server in QEMU
@ 2022-01-19 21:41 Jagannathan Raman
  2022-01-19 21:41 ` [PATCH v5 01/18] configure, meson: override C compiler for cmake Jagannathan Raman
                   ` (18 more replies)
  0 siblings, 19 replies; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Hi,

Thank you for taking the time to provide a comprehensive feedback
of the last series of patches. We have addressed all the comments.

We are posting this v5 of the series, which incorporates all the
feedback. Kindly share your feedback for this latest series

We added the following patches to the series:
  - [PATCH v5 03/18] pci: isolated address space for PCI bus
  - [PATCH v5 04/18] pci: create and free isolated PCI buses
  - [PATCH v5 05/18] qdev: unplug blocker for devices
  - [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine
  - [PATCH v5 07/18] vfio-user: set qdev bus callbacks for remote machine

We made the following changes to the existing patches:

[PATCH v5 09/18] vfio-user: define vfio-user-server object
  - renamed object class member 'daemon' as 'auto_shutdown'
  - set VfioUserServerProperties version to 6.3
  - use SocketAddressType_str to compose error message
  - refuse setting 'socket' and 'device' properties after server starts
  - added VFU_OBJECT_ERROR macro to report error

[PATCH v5 10/18] vfio-user: instantiate vfio-user context
  - set error variable to NULL after transferring ownership with
    error_propagate()

[PATCH v5 11/18] vfio-user: find and init PCI device
  - block hot-unplug of PCI device when it is attached to the server object

[PATCH v5 12/18] vfio-user: run vfio-user context
  - emit a hangup event to the monitor when the client disconnects
  - reset vfu_poll_fd member and disable FD handler during finalize
  - add a comment to explain that attach could block
  - use VFU_OBJECT_ERROR instead of setting error_abort

[PATCH v5 14/18] vfio-user: handle DMA mappings
  - use pci_address_space() to access device's root memory region
  - given we're using one bus per device, mapped memory regions get
    destroyed automatically when device is unplugged

[PATCH v5 15/18] vfio-user: handle PCI BAR accesses
  - use pci_isol_as_io() & pci_isol_as_mem() to access the device's
    PCI/CPU address space. This simultaneously fixes the AddressSpace
    issue noted in the last review cycle

[PATCH v5 16/18] vfio-user: handle device interrupts
  - setting own IRQ handlers for each bus
  - renamed vfu_object_dev_table to vfu_object_dev_to_ctx_table
  - indexing into vfu_object_dev_to_ctx_table with device's
    address pointer instead of devfn
  - not looking up before removing from table

[PATCH v5 17/18] vfio-user: register handlers to facilitate migration
  - use VFU_OBJECT_ERROR instead of setting error_abort

We dropped the following patch from previous series:
  - vfio-user: IOMMU support for remote device

Thank you very much!

Jagannathan Raman (18):
  configure, meson: override C compiler for cmake
  tests/avocado: Specify target VM argument to helper routines
  pci: isolated address space for PCI bus
  pci: create and free isolated PCI buses
  qdev: unplug blocker for devices
  vfio-user: add HotplugHandler for remote machine
  vfio-user: set qdev bus callbacks for remote machine
  vfio-user: build library
  vfio-user: define vfio-user-server object
  vfio-user: instantiate vfio-user context
  vfio-user: find and init PCI device
  vfio-user: run vfio-user context
  vfio-user: handle PCI config space accesses
  vfio-user: handle DMA mappings
  vfio-user: handle PCI BAR accesses
  vfio-user: handle device interrupts
  vfio-user: register handlers to facilitate migration
  vfio-user: avocado tests for vfio-user

 configure                                  |   21 +-
 meson.build                                |   44 +-
 qapi/misc.json                             |   23 +
 qapi/qom.json                              |   20 +-
 include/hw/pci/pci.h                       |   12 +
 include/hw/pci/pci_bus.h                   |   17 +
 include/hw/qdev-core.h                     |   21 +
 include/migration/vmstate.h                |    2 +
 migration/savevm.h                         |    2 +
 hw/pci/msi.c                               |   13 +-
 hw/pci/msix.c                              |   12 +-
 hw/pci/pci.c                               |  186 ++++
 hw/pci/pci_bridge.c                        |    5 +
 hw/remote/machine.c                        |   86 ++
 hw/remote/vfio-user-obj.c                  | 1019 ++++++++++++++++++++
 migration/savevm.c                         |   73 ++
 migration/vmstate.c                        |   19 +
 softmmu/qdev-monitor.c                     |   74 +-
 .gitlab-ci.d/buildtest.yml                 |    2 +
 .gitmodules                                |    3 +
 Kconfig.host                               |    4 +
 MAINTAINERS                                |    3 +
 hw/remote/Kconfig                          |    4 +
 hw/remote/meson.build                      |    3 +
 hw/remote/trace-events                     |   11 +
 meson_options.txt                          |    2 +
 subprojects/libvfio-user                   |    1 +
 tests/avocado/avocado_qemu/__init__.py     |   14 +-
 tests/avocado/vfio-user.py                 |  225 +++++
 tests/docker/dockerfiles/centos8.docker    |    2 +
 tests/docker/dockerfiles/ubuntu2004.docker |    2 +
 31 files changed, 1912 insertions(+), 13 deletions(-)
 create mode 100644 hw/remote/vfio-user-obj.c
 create mode 160000 subprojects/libvfio-user
 create mode 100644 tests/avocado/vfio-user.py

-- 
2.20.1



^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH v5 01/18] configure, meson: override C compiler for cmake
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-20 13:27   ` Paolo Bonzini
  2022-01-19 21:41 ` [PATCH v5 02/18] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

The compiler path that cmake gets from meson is corrupted. It results in
the following error:
| -- The C compiler identification is unknown
| CMake Error at CMakeLists.txt:35 (project):
| The CMAKE_C_COMPILER:
| /opt/rh/devtoolset-9/root/bin/cc;-m64;-mcx16
| is not a full path to an existing compiler tool.

Explicitly specify the C compiler for cmake to avoid this error

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 configure | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/configure b/configure
index e1a31fb332..6a865f8713 100755
--- a/configure
+++ b/configure
@@ -3747,6 +3747,8 @@ if test "$skip_meson" = no; then
   echo "cpp_args = [$(meson_quote $CXXFLAGS $EXTRA_CXXFLAGS)]" >> $cross
   echo "c_link_args = [$(meson_quote $CFLAGS $LDFLAGS $EXTRA_CFLAGS $EXTRA_LDFLAGS)]" >> $cross
   echo "cpp_link_args = [$(meson_quote $CXXFLAGS $LDFLAGS $EXTRA_CXXFLAGS $EXTRA_LDFLAGS)]" >> $cross
+  echo "[cmake]" >> $cross
+  echo "CMAKE_C_COMPILER = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
   echo "[binaries]" >> $cross
   echo "c = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
   test -n "$cxx" && echo "cpp = [$(meson_quote $cxx $CPU_CFLAGS)]" >> $cross
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 02/18] tests/avocado: Specify target VM argument to helper routines
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
  2022-01-19 21:41 ` [PATCH v5 01/18] configure, meson: override C compiler for cmake Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-25  9:40   ` Stefan Hajnoczi
  2022-01-19 21:41 ` [PATCH v5 03/18] pci: isolated address space for PCI bus Jagannathan Raman
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Specify target VM for exec_command and
exec_command_and_wait_for_pattern routines

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Beraldo Leal <bleal@redhat.com>
---
 tests/avocado/avocado_qemu/__init__.py | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
index 75063c0c30..b3fbf77577 100644
--- a/tests/avocado/avocado_qemu/__init__.py
+++ b/tests/avocado/avocado_qemu/__init__.py
@@ -198,7 +198,7 @@ def wait_for_console_pattern(test, success_message, failure_message=None,
     """
     _console_interaction(test, success_message, failure_message, None, vm=vm)
 
-def exec_command(test, command):
+def exec_command(test, command, vm=None):
     """
     Send a command to a console (appending CRLF characters), while logging
     the content.
@@ -207,11 +207,14 @@ def exec_command(test, command):
     :type test: :class:`avocado_qemu.QemuSystemTest`
     :param command: the command to send
     :type command: str
+    :param vm: target vm
+    :type vm: :class:`qemu.machine.QEMUMachine`
     """
-    _console_interaction(test, None, None, command + '\r')
+    _console_interaction(test, None, None, command + '\r', vm=vm)
 
 def exec_command_and_wait_for_pattern(test, command,
-                                      success_message, failure_message=None):
+                                      success_message, failure_message=None,
+                                      vm=None):
     """
     Send a command to a console (appending CRLF characters), then wait
     for success_message to appear on the console, while logging the.
@@ -223,8 +226,11 @@ def exec_command_and_wait_for_pattern(test, command,
     :param command: the command to send
     :param success_message: if this message appears, test succeeds
     :param failure_message: if this message appears, test fails
+    :param vm: target vm
+    :type vm: :class:`qemu.machine.QEMUMachine`
     """
-    _console_interaction(test, success_message, failure_message, command + '\r')
+    _console_interaction(test, success_message, failure_message, command + '\r',
+                         vm=vm)
 
 class QemuBaseTest(avocado.Test):
     def _get_unique_tag_val(self, tag_name):
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
  2022-01-19 21:41 ` [PATCH v5 01/18] configure, meson: override C compiler for cmake Jagannathan Raman
  2022-01-19 21:41 ` [PATCH v5 02/18] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-20  0:12   ` Michael S. Tsirkin
  2022-01-25  9:56   ` Stefan Hajnoczi
  2022-01-19 21:41 ` [PATCH v5 04/18] pci: create and free isolated PCI buses Jagannathan Raman
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Allow PCI buses to be part of isolated CPU address spaces. This has a
niche usage.

TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
the same machine/server. This would cause address space collision as
well as be a security vulnerability. Having separate address spaces for
each PCI bus would solve this problem.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/pci/pci.h     |  2 ++
 include/hw/pci/pci_bus.h | 17 +++++++++++++++++
 hw/pci/pci.c             | 17 +++++++++++++++++
 hw/pci/pci_bridge.c      |  5 +++++
 4 files changed, 41 insertions(+)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 023abc0f79..9bb4472abc 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
 int pci_device_load(PCIDevice *s, QEMUFile *f);
 MemoryRegion *pci_address_space(PCIDevice *dev);
 MemoryRegion *pci_address_space_io(PCIDevice *dev);
+AddressSpace *pci_isol_as_mem(PCIDevice *dev);
+AddressSpace *pci_isol_as_io(PCIDevice *dev);
 
 /*
  * Should not normally be used by devices. For use by sPAPR target
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 347440d42c..d78258e79e 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -39,9 +39,26 @@ struct PCIBus {
     void *irq_opaque;
     PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
     PCIDevice *parent_dev;
+
     MemoryRegion *address_space_mem;
     MemoryRegion *address_space_io;
 
+    /**
+     * Isolated address spaces - these allow the PCI bus to be part
+     * of an isolated address space as opposed to the global
+     * address_space_memory & address_space_io. This allows the
+     * bus to be attached to CPUs from different machines. The
+     * following is not used used commonly.
+     *
+     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
+     * VM clients, as such it needs the PCI buses in the same machine
+     * to be part of different CPU address spaces. The following is
+     * useful in that scenario.
+     *
+     */
+    AddressSpace *isol_as_mem;
+    AddressSpace *isol_as_io;
+
     QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
     QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
 
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 5d30f9ca60..d5f1c6c421 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
     bus->slot_reserved_mask = 0x0;
     bus->address_space_mem = address_space_mem;
     bus->address_space_io = address_space_io;
+    bus->isol_as_mem = NULL;
+    bus->isol_as_io = NULL;
     bus->flags |= PCI_BUS_IS_ROOT;
 
     /* host bridge */
@@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
     return pci_get_bus(dev)->address_space_io;
 }
 
+AddressSpace *pci_isol_as_mem(PCIDevice *dev)
+{
+    return pci_get_bus(dev)->isol_as_mem;
+}
+
+AddressSpace *pci_isol_as_io(PCIDevice *dev)
+{
+    return pci_get_bus(dev)->isol_as_io;
+}
+
 static void pci_device_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *k = DEVICE_CLASS(klass);
@@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
 {
+    AddressSpace *iommu_as = NULL;
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
     uint8_t devfn = dev->devfn;
@@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
     }
+    iommu_as = pci_isol_as_mem(dev);
+    if (iommu_as) {
+        return iommu_as;
+    }
     return &address_space_memory;
 }
 
diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index da34c8ebcd..98366768d2 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
     sec_bus->address_space_io = &br->address_space_io;
     memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
                        4 * GiB);
+
+    /* This PCI bridge puts the sec_bus in its parent's address space */
+    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
+    sec_bus->isol_as_io = pci_isol_as_io(dev);
+
     br->windows = pci_bridge_region_init(br);
     QLIST_INIT(&sec_bus->child);
     QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 04/18] pci: create and free isolated PCI buses
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (2 preceding siblings ...)
  2022-01-19 21:41 ` [PATCH v5 03/18] pci: isolated address space for PCI bus Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-25 10:25   ` Stefan Hajnoczi
  2022-01-19 21:41 ` [PATCH v5 05/18] qdev: unplug blocker for devices Jagannathan Raman
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Adds pci_isol_bus_new() and pci_isol_bus_free() functions to manage
creation and destruction of isolated PCI buses. Also adds qdev_get_bus
and qdev_put_bus callbacks to allow the choice of parent bus.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/pci/pci.h   |   4 +
 include/hw/qdev-core.h |  16 ++++
 hw/pci/pci.c           | 169 +++++++++++++++++++++++++++++++++++++++++
 softmmu/qdev-monitor.c |  39 +++++++++-
 4 files changed, 225 insertions(+), 3 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 9bb4472abc..8c18f10d9d 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -452,6 +452,10 @@ PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus *rootbus,
 
 PCIDevice *pci_vga_init(PCIBus *bus);
 
+PCIBus *pci_isol_bus_new(BusState *parent_bus, const char *new_bus_type,
+                         Error **errp);
+bool pci_isol_bus_free(PCIBus *pci_bus, Error **errp);
+
 static inline PCIBus *pci_get_bus(const PCIDevice *dev)
 {
     return PCI_BUS(qdev_get_parent_bus(DEVICE(dev)));
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 92c3d65208..eed2983072 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -419,6 +419,20 @@ void qdev_simple_device_unplug_cb(HotplugHandler *hotplug_dev,
 void qdev_machine_creation_done(void);
 bool qdev_machine_modified(void);
 
+/**
+ * Find parent bus - these callbacks are used during device addition
+ * and deletion.
+ *
+ * During addition, if no parent bus is specified in the options,
+ * these callbacks provide a way to figure it out based on the
+ * bus type. If these callbacks are not defined, defaults to
+ * finding the parent bus starting from default system bus
+ */
+typedef bool (QDevGetBusFunc)(const char *type, BusState **bus, Error **errp);
+typedef bool (QDevPutBusFunc)(BusState *bus, Error **errp);
+bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
+                      Error **errp);
+
 /**
  * GpioPolarity: Polarity of a GPIO line
  *
@@ -691,6 +705,8 @@ BusState *qdev_get_parent_bus(DeviceState *dev);
 /*** BUS API. ***/
 
 DeviceState *qdev_find_recursive(BusState *bus, const char *id);
+BusState *qbus_find_recursive(BusState *bus, const char *name,
+                              const char *bus_typename);
 
 /* Returns 0 to walk children, > 0 to skip walk, < 0 to terminate walk. */
 typedef int (qbus_walkerfn)(BusState *bus, void *opaque);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index d5f1c6c421..63ec1e47b5 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -493,6 +493,175 @@ void pci_root_bus_cleanup(PCIBus *bus)
     qbus_unrealize(BUS(bus));
 }
 
+static void pci_bus_free_isol_mem(PCIBus *pci_bus)
+{
+    if (pci_bus->address_space_mem) {
+        memory_region_unref(pci_bus->address_space_mem);
+        pci_bus->address_space_mem = NULL;
+    }
+
+    if (pci_bus->isol_as_mem) {
+        address_space_destroy(pci_bus->isol_as_mem);
+        pci_bus->isol_as_mem = NULL;
+    }
+
+    if (pci_bus->address_space_io) {
+        memory_region_unref(pci_bus->address_space_io);
+        pci_bus->address_space_io = NULL;
+    }
+
+    if (pci_bus->isol_as_io) {
+        address_space_destroy(pci_bus->isol_as_io);
+        pci_bus->isol_as_io = NULL;
+    }
+}
+
+static void pci_bus_init_isol_mem(PCIBus *pci_bus, uint32_t unique_id)
+{
+    g_autofree char *mem_mr_name = NULL;
+    g_autofree char *mem_as_name = NULL;
+    g_autofree char *io_mr_name = NULL;
+    g_autofree char *io_as_name = NULL;
+
+    if (!pci_bus) {
+        return;
+    }
+
+    mem_mr_name = g_strdup_printf("mem-mr-%u", unique_id);
+    mem_as_name = g_strdup_printf("mem-as-%u", unique_id);
+    io_mr_name = g_strdup_printf("io-mr-%u", unique_id);
+    io_as_name = g_strdup_printf("io-as-%u", unique_id);
+
+    pci_bus->address_space_mem = g_malloc0(sizeof(MemoryRegion));
+    pci_bus->isol_as_mem = g_malloc0(sizeof(AddressSpace));
+    memory_region_init(pci_bus->address_space_mem, NULL,
+                       mem_mr_name, UINT64_MAX);
+    address_space_init(pci_bus->isol_as_mem,
+                       pci_bus->address_space_mem, mem_as_name);
+
+    pci_bus->address_space_io = g_malloc0(sizeof(MemoryRegion));
+    pci_bus->isol_as_io = g_malloc0(sizeof(AddressSpace));
+    memory_region_init(pci_bus->address_space_io, NULL,
+                       io_mr_name, UINT64_MAX);
+    address_space_init(pci_bus->isol_as_io,
+                       pci_bus->address_space_io, io_as_name);
+}
+
+PCIBus *pci_isol_bus_new(BusState *parent_bus, const char *new_bus_type,
+                         Error **errp)
+{
+    ERRP_GUARD();
+    PCIBus *parent_pci_bus = NULL;
+    DeviceState *pcie_root_port = NULL;
+    g_autofree char *new_bus_name = NULL;
+    PCIBus *new_pci_bus = NULL;
+    HotplugHandler *hotplug_handler = NULL;
+    uint32_t devfn, slot;
+
+    if (!parent_bus) {
+        error_setg(errp, "parent PCI bus not found");
+        return NULL;
+    }
+
+    if (!new_bus_type ||
+        (strcmp(new_bus_type, TYPE_PCIE_BUS) &&
+         strcmp(new_bus_type, TYPE_PCI_BUS))) {
+        error_setg(errp, "bus type must be %s or %s", TYPE_PCIE_BUS,
+                   TYPE_PCI_BUS);
+        return NULL;
+    }
+
+    if (!object_dynamic_cast(OBJECT(parent_bus), TYPE_PCI_BUS)) {
+        error_setg(errp, "Unsupported root bus type");
+        return NULL;
+    }
+
+    parent_pci_bus = PCI_BUS(parent_bus);
+
+    /**
+     * Create TYPE_GEN_PCIE_ROOT_PORT device to interface parent and
+     * new buses.
+     */
+    for (devfn = 0; devfn < (PCI_SLOT_MAX * PCI_FUNC_MAX);
+         devfn += PCI_FUNC_MAX) {
+        if (!parent_pci_bus->devices[devfn]) {
+            break;
+        }
+    }
+    if (devfn == (PCI_SLOT_MAX * PCI_FUNC_MAX)) {
+        error_setg(errp, "parent PCI slots full");
+        return NULL;
+    }
+
+    slot = devfn / PCI_FUNC_MAX;
+    pcie_root_port = qdev_new("pcie-root-port");
+    if (!object_property_set_int(OBJECT(pcie_root_port), "slot",
+                                 slot, errp)){
+        error_prepend(errp, "Failed to set slot property for root port: ");
+        goto fail_rp_init;
+    }
+    if (!qdev_realize(pcie_root_port, parent_bus, errp)) {
+        goto fail_rp_init;
+    }
+
+    /**
+     * Create new PCI bus and plug it to the root port
+     */
+    new_bus_name = g_strdup_printf("pci-bus-%d", (slot + 1));
+    new_pci_bus = PCI_BUS(qbus_new(new_bus_type, pcie_root_port, new_bus_name));
+    new_pci_bus->parent_dev = PCI_DEVICE(pcie_root_port);
+    hotplug_handler = qdev_get_bus_hotplug_handler(pcie_root_port);
+    qbus_set_hotplug_handler(BUS(new_pci_bus), OBJECT(hotplug_handler));
+    pci_default_write_config(new_pci_bus->parent_dev, PCI_SECONDARY_BUS,
+                             (slot + 1), 1);
+    pci_default_write_config(new_pci_bus->parent_dev, PCI_SUBORDINATE_BUS,
+                             (slot + 1), 1);
+    pci_bus_init_isol_mem(new_pci_bus, (slot + 1));
+
+    QLIST_INIT(&new_pci_bus->child);
+    QLIST_INSERT_HEAD(&parent_pci_bus->child, new_pci_bus, sibling);
+
+    if (!qbus_realize(BUS(new_pci_bus), errp)) {
+        QLIST_REMOVE(new_pci_bus, sibling);
+        pci_bus_free_isol_mem(new_pci_bus);
+        object_unparent(OBJECT(new_pci_bus));
+        new_pci_bus = NULL;
+        goto fail_rp_init;
+    }
+
+    return new_pci_bus;
+
+fail_rp_init:
+    qdev_unrealize(pcie_root_port);
+    object_unparent(OBJECT(pcie_root_port));
+    pcie_root_port = NULL;
+    return NULL;
+}
+
+bool pci_isol_bus_free(PCIBus *pci_bus, Error **errp)
+{
+    ERRP_GUARD();
+    PCIDevice *pcie_root_port = pci_bus->parent_dev;
+
+    if (!pcie_root_port) {
+        error_setg(errp, "Can't unplug root bus");
+        return false;
+    }
+
+    if (!QLIST_EMPTY(&pci_bus->child)) {
+        error_setg(errp, "Bus has attached device");
+        return false;
+    }
+
+    QLIST_REMOVE(pci_bus, sibling);
+    pci_bus_free_isol_mem(pci_bus);
+    qbus_unrealize(BUS(pci_bus));
+    object_unparent(OBJECT(pci_bus));
+    qdev_unrealize(DEVICE(pcie_root_port));
+    object_unparent(OBJECT(pcie_root_port));
+    return true;
+}
+
 void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq, pci_map_irq_fn map_irq,
                   void *irq_opaque, int nirq)
 {
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 01f3834db5..7306074019 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -64,6 +64,9 @@ typedef struct QDevAlias
 #define QEMU_ARCH_VIRTIO_CCW (QEMU_ARCH_S390X)
 #define QEMU_ARCH_VIRTIO_MMIO (QEMU_ARCH_M68K)
 
+static QDevGetBusFunc *qdev_get_bus;
+static QDevPutBusFunc *qdev_put_bus;
+
 /* Please keep this table sorted by typename. */
 static const QDevAlias qdev_alias_table[] = {
     { "AC97", "ac97" }, /* -soundhw name */
@@ -450,7 +453,7 @@ static inline bool qbus_is_full(BusState *bus)
  * If more than one exists, prefer one that can take another device.
  * Return the bus if found, else %NULL.
  */
-static BusState *qbus_find_recursive(BusState *bus, const char *name,
+BusState *qbus_find_recursive(BusState *bus, const char *name,
                                      const char *bus_typename)
 {
     BusChild *kid;
@@ -608,6 +611,20 @@ const char *qdev_set_id(DeviceState *dev, char *id, Error **errp)
     return prop->name;
 }
 
+
+bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
+                      Error **errp)
+{
+    if (qdev_get_bus || qdev_put_bus) {
+        error_setg(errp, "callbacks already set");
+        return false;
+    }
+
+    qdev_get_bus = get_bus;
+    qdev_put_bus = put_bus;
+    return true;
+}
+
 DeviceState *qdev_device_add_from_qdict(const QDict *opts,
                                         bool from_json, Error **errp)
 {
@@ -642,7 +659,13 @@ DeviceState *qdev_device_add_from_qdict(const QDict *opts,
                        driver, object_get_typename(OBJECT(bus)));
             return NULL;
         }
-    } else if (dc->bus_type != NULL) {
+    } else if (dc->bus_type != NULL && qdev_get_bus != NULL) {
+        if (!qdev_get_bus(dc->bus_type, &bus, errp)) {
+            return NULL;
+        }
+    }
+
+    if (!bus && dc->bus_type != NULL) {
         bus = qbus_find_recursive(sysbus_get_default(), NULL, dc->bus_type);
         if (!bus || qbus_is_full(bus)) {
             error_setg(errp, "No '%s' bus found for device '%s'",
@@ -891,10 +914,12 @@ static DeviceState *find_device_state(const char *id, Error **errp)
 
 void qdev_unplug(DeviceState *dev, Error **errp)
 {
+    ERRP_GUARD();
     DeviceClass *dc = DEVICE_GET_CLASS(dev);
     HotplugHandler *hotplug_ctrl;
     HotplugHandlerClass *hdc;
     Error *local_err = NULL;
+    BusState *parent_bus = qdev_get_parent_bus(dev);
 
     if (dev->parent_bus && !qbus_is_hotpluggable(dev->parent_bus)) {
         error_setg(errp, QERR_BUS_NO_HOTPLUG, dev->parent_bus->name);
@@ -930,7 +955,15 @@ void qdev_unplug(DeviceState *dev, Error **errp)
             object_unparent(OBJECT(dev));
         }
     }
-    error_propagate(errp, local_err);
+
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    if (qdev_put_bus) {
+        qdev_put_bus(parent_bus, errp);
+    }
 }
 
 void qmp_device_del(const char *id, Error **errp)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 05/18] qdev: unplug blocker for devices
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (3 preceding siblings ...)
  2022-01-19 21:41 ` [PATCH v5 04/18] pci: create and free isolated PCI buses Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-25 10:27   ` Stefan Hajnoczi
  2022-01-19 21:41 ` [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine Jagannathan Raman
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/qdev-core.h |  5 +++++
 softmmu/qdev-monitor.c | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index eed2983072..67df5e0081 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -193,6 +193,7 @@ struct DeviceState {
     int instance_id_alias;
     int alias_required_for_version;
     ResettableState reset;
+    GSList *unplug_blockers;
 };
 
 struct DeviceListener {
@@ -433,6 +434,10 @@ typedef bool (QDevPutBusFunc)(BusState *bus, Error **errp);
 bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
                       Error **errp);
 
+int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp);
+void qdev_del_unplug_blocker(DeviceState *dev, Error *reason);
+bool qdev_unplug_blocked(DeviceState *dev, Error **errp);
+
 /**
  * GpioPolarity: Polarity of a GPIO line
  *
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 7306074019..1a169f89a2 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -978,10 +978,45 @@ void qmp_device_del(const char *id, Error **errp)
             return;
         }
 
+        if (qdev_unplug_blocked(dev, errp)) {
+            return;
+        }
+
         qdev_unplug(dev, errp);
     }
 }
 
+int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp)
+{
+    ERRP_GUARD();
+
+    if (!migration_is_idle()) {
+        error_setg(errp, "migration is in progress");
+        return -EBUSY;
+    }
+
+    dev->unplug_blockers = g_slist_prepend(dev->unplug_blockers, reason);
+
+    return 0;
+}
+
+void qdev_del_unplug_blocker(DeviceState *dev, Error *reason)
+{
+    dev->unplug_blockers = g_slist_remove(dev->unplug_blockers, reason);
+}
+
+bool qdev_unplug_blocked(DeviceState *dev, Error **errp)
+{
+    ERRP_GUARD();
+
+    if (dev->unplug_blockers) {
+        error_propagate(errp, error_copy(dev->unplug_blockers->data));
+        return true;
+    }
+
+    return false;
+}
+
 void hmp_device_add(Monitor *mon, const QDict *qdict)
 {
     Error *err = NULL;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (4 preceding siblings ...)
  2022-01-19 21:41 ` [PATCH v5 05/18] qdev: unplug blocker for devices Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-25 10:32   ` Stefan Hajnoczi
  2022-01-19 21:41 ` [PATCH v5 07/18] vfio-user: set qdev bus callbacks " Jagannathan Raman
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Allow hotplugging of PCI(e) devices to remote machine

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/machine.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/hw/remote/machine.c b/hw/remote/machine.c
index 952105eab5..220ff01aa9 100644
--- a/hw/remote/machine.c
+++ b/hw/remote/machine.c
@@ -54,14 +54,39 @@ static void remote_machine_init(MachineState *machine)
 
     pci_bus_irqs(pci_host->bus, remote_iohub_set_irq, remote_iohub_map_irq,
                  &s->iohub, REMOTE_IOHUB_NB_PIRQS);
+
+    qbus_set_hotplug_handler(BUS(pci_host->bus), OBJECT(s));
+}
+
+static void remote_machine_pre_plug_cb(HotplugHandler *hotplug_dev,
+                                       DeviceState *dev, Error **errp)
+{
+    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
+        error_setg(errp, "Only allowing PCI hotplug");
+    }
+}
+
+static void remote_machine_unplug_cb(HotplugHandler *hotplug_dev,
+                                     DeviceState *dev, Error **errp)
+{
+    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
+        error_setg(errp, "Only allowing PCI hot-unplug");
+        return;
+    }
+
+    qdev_unrealize(dev);
 }
 
 static void remote_machine_class_init(ObjectClass *oc, void *data)
 {
     MachineClass *mc = MACHINE_CLASS(oc);
+    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);
 
     mc->init = remote_machine_init;
     mc->desc = "Experimental remote machine";
+
+    hc->pre_plug = remote_machine_pre_plug_cb;
+    hc->unplug = remote_machine_unplug_cb;
 }
 
 static const TypeInfo remote_machine = {
@@ -69,6 +94,10 @@ static const TypeInfo remote_machine = {
     .parent = TYPE_MACHINE,
     .instance_size = sizeof(RemoteMachineState),
     .class_init = remote_machine_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_HOTPLUG_HANDLER },
+        { }
+    }
 };
 
 static void remote_machine_register_types(void)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 07/18] vfio-user: set qdev bus callbacks for remote machine
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (5 preceding siblings ...)
  2022-01-19 21:41 ` [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-25 10:44   ` Stefan Hajnoczi
  2022-01-19 21:41 ` [PATCH v5 08/18] vfio-user: build library Jagannathan Raman
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/machine.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/hw/remote/machine.c b/hw/remote/machine.c
index 220ff01aa9..221a8430c1 100644
--- a/hw/remote/machine.c
+++ b/hw/remote/machine.c
@@ -22,6 +22,60 @@
 #include "hw/pci/pci_host.h"
 #include "hw/remote/iohub.h"
 
+static bool remote_machine_get_bus(const char *type, BusState **bus,
+                                   Error **errp)
+{
+    ERRP_GUARD();
+    RemoteMachineState *s = REMOTE_MACHINE(current_machine);
+    BusState *root_bus = NULL;
+    PCIBus *new_pci_bus = NULL;
+
+    if (!bus) {
+        error_setg(errp, "Invalid argument");
+        return false;
+    }
+
+    if (strcmp(type, TYPE_PCI_BUS) && strcmp(type, TYPE_PCI_BUS)) {
+        return true;
+    }
+
+    root_bus = qbus_find_recursive(sysbus_get_default(), NULL, TYPE_PCIE_BUS);
+    if (!root_bus) {
+        error_setg(errp, "Unable to find root PCI device");
+        return false;
+    }
+
+    new_pci_bus = pci_isol_bus_new(root_bus, type, errp);
+    if (!new_pci_bus) {
+        return false;
+    }
+
+    *bus = BUS(new_pci_bus);
+
+    pci_bus_irqs(new_pci_bus, remote_iohub_set_irq, remote_iohub_map_irq,
+                 &s->iohub, REMOTE_IOHUB_NB_PIRQS);
+
+    return true;
+}
+
+static bool remote_machine_put_bus(BusState *bus, Error **errp)
+{
+    PCIBus *pci_bus = NULL;
+
+    if (!bus) {
+        error_setg(errp, "Invalid argument");
+        return false;
+    }
+
+    if (!object_dynamic_cast(OBJECT(bus), TYPE_PCI_BUS)) {
+        return true;
+    }
+
+    pci_bus = PCI_BUS(bus);
+
+    return pci_isol_bus_free(pci_bus, errp);
+}
+
 static void remote_machine_init(MachineState *machine)
 {
     MemoryRegion *system_memory, *system_io, *pci_memory;
@@ -56,6 +110,9 @@ static void remote_machine_init(MachineState *machine)
                  &s->iohub, REMOTE_IOHUB_NB_PIRQS);
 
     qbus_set_hotplug_handler(BUS(pci_host->bus), OBJECT(s));
+
+    qdev_set_bus_cbs(remote_machine_get_bus, remote_machine_put_bus,
+                     &error_fatal);
 }
 
 static void remote_machine_pre_plug_cb(HotplugHandler *hotplug_dev,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 08/18] vfio-user: build library
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (6 preceding siblings ...)
  2022-01-19 21:41 ` [PATCH v5 07/18] vfio-user: set qdev bus callbacks " Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-19 21:41 ` [PATCH v5 09/18] vfio-user: define vfio-user-server object Jagannathan Raman
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

add the libvfio-user library as a submodule. build it as a cmake
subproject.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 configure                                  | 19 +++++++++-
 meson.build                                | 44 +++++++++++++++++++++-
 .gitlab-ci.d/buildtest.yml                 |  2 +
 .gitmodules                                |  3 ++
 Kconfig.host                               |  4 ++
 MAINTAINERS                                |  1 +
 hw/remote/Kconfig                          |  4 ++
 hw/remote/meson.build                      |  2 +
 meson_options.txt                          |  2 +
 subprojects/libvfio-user                   |  1 +
 tests/docker/dockerfiles/centos8.docker    |  2 +
 tests/docker/dockerfiles/ubuntu2004.docker |  2 +
 12 files changed, 84 insertions(+), 2 deletions(-)
 create mode 160000 subprojects/libvfio-user

diff --git a/configure b/configure
index 6a865f8713..c8035de952 100755
--- a/configure
+++ b/configure
@@ -356,6 +356,7 @@ ninja=""
 gio="$default_feature"
 skip_meson=no
 slirp_smbd="$default_feature"
+vfio_user_server="disabled"
 
 # The following Meson options are handled manually (still they
 # are included in the automatically generated help message)
@@ -1172,6 +1173,10 @@ for opt do
   ;;
   --disable-blobs) meson_option_parse --disable-install-blobs ""
   ;;
+  --enable-vfio-user-server) vfio_user_server="enabled"
+  ;;
+  --disable-vfio-user-server) vfio_user_server="disabled"
+  ;;
   --enable-tcmalloc) meson_option_parse --enable-malloc=tcmalloc tcmalloc
   ;;
   --enable-jemalloc) meson_option_parse --enable-malloc=jemalloc jemalloc
@@ -1439,6 +1444,7 @@ cat << EOF
   rng-none        dummy RNG, avoid using /dev/(u)random and getrandom()
   gio             libgio support
   slirp-smbd      use smbd (at path --smbd=*) in slirp networking
+  vfio-user-server    vfio-user server support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -3121,6 +3127,17 @@ but not implemented on your system"
     fi
 fi
 
+##########################################
+# check for vfio_user_server
+
+case "$vfio_user_server" in
+  auto | enabled )
+    if test "$git_submodules_action" != "ignore"; then
+      git_submodules="${git_submodules} subprojects/libvfio-user"
+    fi
+    ;;
+esac
+
 ##########################################
 # End of CC checks
 # After here, no more $cc or $ld runs
@@ -3811,7 +3828,7 @@ if test "$skip_meson" = no; then
         -Db_pie=$(if test "$pie" = yes; then echo true; else echo false; fi) \
         -Db_coverage=$(if test "$gcov" = yes; then echo true; else echo false; fi) \
         -Db_lto=$lto -Dcfi=$cfi -Dtcg=$tcg -Dxen=$xen \
-        -Dcapstone=$capstone -Dfdt=$fdt -Dslirp=$slirp \
+        -Dcapstone=$capstone -Dfdt=$fdt -Dslirp=$slirp -Dvfio_user_server=$vfio_user_server \
         $(test -n "${LIB_FUZZING_ENGINE+xxx}" && echo "-Dfuzzing_engine=$LIB_FUZZING_ENGINE") \
         $(if test "$default_feature" = no; then echo "-Dauto_features=disabled"; fi) \
         "$@" $cross_arg "$PWD" "$source_path"
diff --git a/meson.build b/meson.build
index 333c61deba..15c2567543 100644
--- a/meson.build
+++ b/meson.build
@@ -274,6 +274,11 @@ if targetos != 'linux' and get_option('multiprocess').enabled()
 endif
 multiprocess_allowed = targetos == 'linux' and not get_option('multiprocess').disabled()
 
+if targetos != 'linux' and get_option('vfio_user_server').enabled()
+  error('vfio-user server is supported only on Linux')
+endif
+vfio_user_server_allowed = targetos == 'linux' and not get_option('vfio_user_server').disabled()
+
 # Target-specific libraries and flags
 libm = cc.find_library('m', required: false)
 threads = dependency('threads')
@@ -1877,7 +1882,8 @@ host_kconfig = \
   (have_virtfs ? ['CONFIG_VIRTFS=y'] : []) + \
   ('CONFIG_LINUX' in config_host ? ['CONFIG_LINUX=y'] : []) + \
   ('CONFIG_PVRDMA' in config_host ? ['CONFIG_PVRDMA=y'] : []) + \
-  (multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : [])
+  (multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : []) + \
+  (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : [])
 
 ignored = [ 'TARGET_XML_FILES', 'TARGET_ABI_DIR', 'TARGET_ARCH' ]
 
@@ -2266,6 +2272,41 @@ if get_option('cfi') and slirp_opt == 'system'
          + ' Please configure with --enable-slirp=git')
 endif
 
+vfiouser = not_found
+if have_system and vfio_user_server_allowed
+  have_internal = fs.exists(meson.current_source_dir() / 'subprojects/libvfio-user/Makefile')
+
+  if not have_internal
+    error('libvfio-user source not found - please pull git submodule')
+  endif
+
+  json_c = dependency('json-c', required: false)
+  if not json_c.found()
+    json_c = dependency('libjson-c', required: false)
+  endif
+  if not json_c.found()
+    json_c = dependency('libjson-c-dev', required: false)
+  endif
+
+  if not json_c.found()
+    error('Unable to find json-c package')
+  endif
+
+  cmake = import('cmake')
+
+  vfiouser_subproj = cmake.subproject('libvfio-user')
+
+  vfiouser_sl = vfiouser_subproj.dependency('vfio-user-static')
+
+  # Although cmake links the json-c library with vfio-user-static
+  # target, that info is not available to meson via cmake.subproject.
+  # As such, we have to separately declare the json-c dependency here.
+  # This appears to be a current limitation of using cmake inside meson.
+  # libvfio-user is planning a switch to meson in the future, which
+  # would address this item automatically.
+  vfiouser = declare_dependency(dependencies: [vfiouser_sl, json_c])
+endif
+
 fdt = not_found
 fdt_opt = get_option('fdt')
 if have_system
@@ -3368,6 +3409,7 @@ summary_info += {'target list':       ' '.join(target_dirs)}
 if have_system
   summary_info += {'default devices':   get_option('default_devices')}
   summary_info += {'out of process emulation': multiprocess_allowed}
+  summary_info += {'vfio-user server': vfio_user_server_allowed}
 endif
 summary(summary_info, bool_yn: true, section: 'Targets and accelerators')
 
diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
index 8f2a3c8f5b..07c36fb15d 100644
--- a/.gitlab-ci.d/buildtest.yml
+++ b/.gitlab-ci.d/buildtest.yml
@@ -42,6 +42,7 @@ build-system-ubuntu:
   variables:
     IMAGE: ubuntu2004
     CONFIGURE_ARGS: --enable-docs --enable-fdt=system --enable-slirp=system
+                    --enable-vfio-user-server
     TARGETS: aarch64-softmmu alpha-softmmu cris-softmmu hppa-softmmu
       microblazeel-softmmu mips64el-softmmu
     MAKE_CHECK_ARGS: check-build
@@ -165,6 +166,7 @@ build-system-centos:
     IMAGE: centos8
     CONFIGURE_ARGS: --disable-nettle --enable-gcrypt --enable-fdt=system
       --enable-modules --enable-trace-backends=dtrace --enable-docs
+      --enable-vfio-user-server
     TARGETS: ppc64-softmmu or1k-softmmu s390x-softmmu
       x86_64-softmmu rx-softmmu sh4-softmmu nios2-softmmu
     MAKE_CHECK_ARGS: check-build
diff --git a/.gitmodules b/.gitmodules
index 84425d87e2..4ae2a165d9 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -67,3 +67,6 @@
 [submodule "tests/lcitool/libvirt-ci"]
 	path = tests/lcitool/libvirt-ci
 	url = http://gitlab.com/libvirt/libvirt-ci
+[submodule "subprojects/libvfio-user"]
+	path = subprojects/libvfio-user
+	url = https://github.com/nutanix/libvfio-user.git
diff --git a/Kconfig.host b/Kconfig.host
index 60b9c07b5e..f2da8bcf8a 100644
--- a/Kconfig.host
+++ b/Kconfig.host
@@ -45,3 +45,7 @@ config MULTIPROCESS_ALLOWED
 config FUZZ
     bool
     select SPARSE_MEM
+
+config VFIO_USER_SERVER_ALLOWED
+    bool
+    imply VFIO_USER_SERVER
diff --git a/MAINTAINERS b/MAINTAINERS
index 2fd74c4642..8d7bebc74a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3486,6 +3486,7 @@ F: hw/remote/proxy-memory-listener.c
 F: include/hw/remote/proxy-memory-listener.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
+F: subprojects/libvfio-user
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/hw/remote/Kconfig b/hw/remote/Kconfig
index 08c16e235f..2d6b4f4cf4 100644
--- a/hw/remote/Kconfig
+++ b/hw/remote/Kconfig
@@ -2,3 +2,7 @@ config MULTIPROCESS
     bool
     depends on PCI && PCI_EXPRESS && KVM
     select REMOTE_PCIHOST
+
+config VFIO_USER_SERVER
+    bool
+    depends on MULTIPROCESS
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index e6a5574242..dfea6b533b 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -7,6 +7,8 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
 
+remote_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_true: vfiouser)
+
 specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('memory.c'))
 specific_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy-memory-listener.c'))
 
diff --git a/meson_options.txt b/meson_options.txt
index 921967eddb..7f02794d4b 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -68,6 +68,8 @@ option('multiprocess', type: 'feature', value: 'auto',
        description: 'Out of process device emulation support')
 option('dbus_display', type: 'feature', value: 'auto',
        description: '-display dbus support')
+option('vfio_user_server', type: 'feature', value: 'auto',
+       description: 'vfio-user server support')
 
 option('attr', type : 'feature', value : 'auto',
        description: 'attr/xattr support')
diff --git a/subprojects/libvfio-user b/subprojects/libvfio-user
new file mode 160000
index 0000000000..7056525da5
--- /dev/null
+++ b/subprojects/libvfio-user
@@ -0,0 +1 @@
+Subproject commit 7056525da5399d00831e90bed4aedb4b8442c9b2
diff --git a/tests/docker/dockerfiles/centos8.docker b/tests/docker/dockerfiles/centos8.docker
index cbb909d02b..f8dff989de 100644
--- a/tests/docker/dockerfiles/centos8.docker
+++ b/tests/docker/dockerfiles/centos8.docker
@@ -23,6 +23,7 @@ RUN dnf update -y && \
         capstone-devel \
         ccache \
         clang \
+        cmake \
         ctags \
         cyrus-sasl-devel \
         daxctl-devel \
@@ -45,6 +46,7 @@ RUN dnf update -y && \
         gtk3-devel \
         hostname \
         jemalloc-devel \
+        json-c-devel \
         libaio-devel \
         libasan \
         libattr-devel \
diff --git a/tests/docker/dockerfiles/ubuntu2004.docker b/tests/docker/dockerfiles/ubuntu2004.docker
index 4e562dfdcd..d16a73dec8 100644
--- a/tests/docker/dockerfiles/ubuntu2004.docker
+++ b/tests/docker/dockerfiles/ubuntu2004.docker
@@ -18,6 +18,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
             ca-certificates \
             ccache \
             clang \
+            cmake \
             dbus \
             debianutils \
             diffutils \
@@ -57,6 +58,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
             libiscsi-dev \
             libjemalloc-dev \
             libjpeg-turbo8-dev \
+            libjson-c-dev \
             liblttng-ust-dev \
             liblzo2-dev \
             libncursesw5-dev \
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 09/18] vfio-user: define vfio-user-server object
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (7 preceding siblings ...)
  2022-01-19 21:41 ` [PATCH v5 08/18] vfio-user: build library Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-25 14:40   ` Stefan Hajnoczi
  2022-01-19 21:41 ` [PATCH v5 10/18] vfio-user: instantiate vfio-user context Jagannathan Raman
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Define vfio-user object which is remote process server for QEMU. Setup
object initialization functions and properties necessary to instantiate
the object

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 qapi/qom.json             |  20 +++-
 hw/remote/vfio-user-obj.c | 194 ++++++++++++++++++++++++++++++++++++++
 MAINTAINERS               |   1 +
 hw/remote/meson.build     |   1 +
 hw/remote/trace-events    |   3 +
 5 files changed, 217 insertions(+), 2 deletions(-)
 create mode 100644 hw/remote/vfio-user-obj.c

diff --git a/qapi/qom.json b/qapi/qom.json
index eeb5395ff3..ff266e4732 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -703,6 +703,20 @@
 { 'struct': 'RemoteObjectProperties',
   'data': { 'fd': 'str', 'devid': 'str' } }
 
+##
+# @VfioUserServerProperties:
+#
+# Properties for x-vfio-user-server objects.
+#
+# @socket: socket to be used by the libvfiouser library
+#
+# @device: the id of the device to be emulated at the server
+#
+# Since: 6.3
+##
+{ 'struct': 'VfioUserServerProperties',
+  'data': { 'socket': 'SocketAddress', 'device': 'str' } }
+
 ##
 # @RngProperties:
 #
@@ -842,7 +856,8 @@
     'tls-creds-psk',
     'tls-creds-x509',
     'tls-cipher-suites',
-    { 'name': 'x-remote-object', 'features': [ 'unstable' ] }
+    { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
+    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
   ] }
 
 ##
@@ -905,7 +920,8 @@
       'tls-creds-psk':              'TlsCredsPskProperties',
       'tls-creds-x509':             'TlsCredsX509Properties',
       'tls-cipher-suites':          'TlsCredsProperties',
-      'x-remote-object':            'RemoteObjectProperties'
+      'x-remote-object':            'RemoteObjectProperties',
+      'x-vfio-user-server':         'VfioUserServerProperties'
   } }
 
 ##
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
new file mode 100644
index 0000000000..80757b0029
--- /dev/null
+++ b/hw/remote/vfio-user-obj.c
@@ -0,0 +1,194 @@
+/**
+ * QEMU vfio-user-server server object
+ *
+ * Copyright © 2022 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL-v2, version 2 or later.
+ *
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/**
+ * Usage: add options:
+ *     -machine x-remote
+ *     -device <PCI-device>,id=<pci-dev-id>
+ *     -object x-vfio-user-server,id=<id>,type=unix,path=<socket-path>,
+ *             device=<pci-dev-id>
+ *
+ * Note that x-vfio-user-server object must be used with x-remote machine only.
+ * This server could only support PCI devices for now.
+ *
+ * type - SocketAddress type - presently "unix" alone is supported. Required
+ *        option
+ *
+ * path - named unix socket, it will be created by the server. It is
+ *        a required option
+ *
+ * device - id of a device on the server, a required option. PCI devices
+ *          alone are supported presently.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qom/object.h"
+#include "qom/object_interfaces.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+#include "sysemu/runstate.h"
+#include "hw/boards.h"
+#include "hw/remote/machine.h"
+#include "qapi/error.h"
+#include "qapi/qapi-visit-sockets.h"
+
+#define TYPE_VFU_OBJECT "x-vfio-user-server"
+OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
+
+/**
+ * VFU_OBJECT_ERROR - reports an error message. If auto_shutdown
+ * is set, it abort the machine on error. Otherwise, it logs an
+ * error message without aborting.
+ */
+#define VFU_OBJECT_ERROR(o, fmt, ...)                         \
+    {                                                         \
+        VfuObjectClass *oc = VFU_OBJECT_GET_CLASS(OBJECT(o)); \
+                                                              \
+        if (oc->auto_shutdown) {                              \
+            error_setg(&error_abort, (fmt), ## __VA_ARGS__);  \
+        } else {                                              \
+            error_report((fmt), ## __VA_ARGS__);              \
+        }                                                     \
+    }                                                         \
+
+struct VfuObjectClass {
+    ObjectClass parent_class;
+
+    unsigned int nr_devs;
+
+    /*
+     * Can be set to shutdown automatically when all server object
+     * instances are destroyed
+     */
+    bool auto_shutdown;
+};
+
+struct VfuObject {
+    /* private */
+    Object parent;
+
+    SocketAddress *socket;
+
+    char *device;
+
+    Error *err;
+};
+
+static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
+                                  void *opaque, Error **errp)
+{
+    VfuObject *o = VFU_OBJECT(obj);
+
+    qapi_free_SocketAddress(o->socket);
+
+    o->socket = NULL;
+
+    visit_type_SocketAddress(v, name, &o->socket, errp);
+
+    if (o->socket->type != SOCKET_ADDRESS_TYPE_UNIX) {
+        qapi_free_SocketAddress(o->socket);
+        o->socket = NULL;
+        error_setg(errp, "vfu: Unsupported socket type - %s",
+                   SocketAddressType_str(o->socket->type));
+        return;
+    }
+
+    trace_vfu_prop("socket", o->socket->u.q_unix.path);
+}
+
+static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
+{
+    VfuObject *o = VFU_OBJECT(obj);
+
+    g_free(o->device);
+
+    o->device = g_strdup(str);
+
+    trace_vfu_prop("device", str);
+}
+
+static void vfu_object_init(Object *obj)
+{
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+    VfuObject *o = VFU_OBJECT(obj);
+
+    k->nr_devs++;
+
+    if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
+        error_setg(&o->err, "vfu: %s only compatible with %s machine",
+                   TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
+        return;
+    }
+}
+
+static void vfu_object_finalize(Object *obj)
+{
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(obj);
+    VfuObject *o = VFU_OBJECT(obj);
+
+    k->nr_devs--;
+
+    qapi_free_SocketAddress(o->socket);
+
+    o->socket = NULL;
+
+    g_free(o->device);
+
+    o->device = NULL;
+
+    if (!k->nr_devs && k->auto_shutdown) {
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    }
+}
+
+static void vfu_object_class_init(ObjectClass *klass, void *data)
+{
+    VfuObjectClass *k = VFU_OBJECT_CLASS(klass);
+
+    k->nr_devs = 0;
+
+    k->auto_shutdown = true;
+
+    object_class_property_add(klass, "socket", "SocketAddress", NULL,
+                              vfu_object_set_socket, NULL, NULL);
+    object_class_property_set_description(klass, "socket",
+                                          "SocketAddress "
+                                          "(ex: type=unix,path=/tmp/sock). "
+                                          "Only UNIX is presently supported");
+    object_class_property_add_str(klass, "device", NULL,
+                                  vfu_object_set_device);
+    object_class_property_set_description(klass, "device",
+                                          "device ID - only PCI devices "
+                                          "are presently supported");
+}
+
+static const TypeInfo vfu_object_info = {
+    .name = TYPE_VFU_OBJECT,
+    .parent = TYPE_OBJECT,
+    .instance_size = sizeof(VfuObject),
+    .instance_init = vfu_object_init,
+    .instance_finalize = vfu_object_finalize,
+    .class_size = sizeof(VfuObjectClass),
+    .class_init = vfu_object_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_USER_CREATABLE },
+        { }
+    }
+};
+
+static void vfu_register_types(void)
+{
+    type_register_static(&vfu_object_info);
+}
+
+type_init(vfu_register_types);
diff --git a/MAINTAINERS b/MAINTAINERS
index 8d7bebc74a..93bce3fa62 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3487,6 +3487,7 @@ F: include/hw/remote/proxy-memory-listener.h
 F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
 F: subprojects/libvfio-user
+F: hw/remote/vfio-user-obj.c
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index dfea6b533b..534ac5df79 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -6,6 +6,7 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('message.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('remote-obj.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('proxy.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('iohub.c'))
+remote_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_true: files('vfio-user-obj.c'))
 
 remote_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_true: vfiouser)
 
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 0b23974f90..7da12f0d96 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -2,3 +2,6 @@
 
 mpqemu_send_io_error(int cmd, int size, int nfds) "send command %d size %d, %d file descriptors to remote process"
 mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d, %d file descriptors to remote process"
+
+# vfio-user-obj.c
+vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 10/18] vfio-user: instantiate vfio-user context
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (8 preceding siblings ...)
  2022-01-19 21:41 ` [PATCH v5 09/18] vfio-user: define vfio-user-server object Jagannathan Raman
@ 2022-01-19 21:41 ` Jagannathan Raman
  2022-01-25 14:44   ` Stefan Hajnoczi
  2022-01-19 21:42 ` [PATCH v5 11/18] vfio-user: find and init PCI device Jagannathan Raman
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:41 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

create a context with the vfio-user library to run a PCI device

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 78 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 80757b0029..810a7c3943 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -41,6 +41,9 @@
 #include "hw/remote/machine.h"
 #include "qapi/error.h"
 #include "qapi/qapi-visit-sockets.h"
+#include "qemu/notify.h"
+#include "sysemu/sysemu.h"
+#include "libvfio-user.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -82,13 +85,23 @@ struct VfuObject {
     char *device;
 
     Error *err;
+
+    Notifier machine_done;
+
+    vfu_ctx_t *vfu_ctx;
 };
 
+static void vfu_object_init_ctx(VfuObject *o, Error **errp);
+
 static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
                                   void *opaque, Error **errp)
 {
     VfuObject *o = VFU_OBJECT(obj);
 
+    if (o->vfu_ctx) {
+        return;
+    }
+
     qapi_free_SocketAddress(o->socket);
 
     o->socket = NULL;
@@ -104,17 +117,68 @@ static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
     }
 
     trace_vfu_prop("socket", o->socket->u.q_unix.path);
+
+    vfu_object_init_ctx(o, errp);
 }
 
 static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
 {
     VfuObject *o = VFU_OBJECT(obj);
 
+    if (o->vfu_ctx) {
+        return;
+    }
+
     g_free(o->device);
 
     o->device = g_strdup(str);
 
     trace_vfu_prop("device", str);
+
+    vfu_object_init_ctx(o, errp);
+}
+
+/*
+ * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
+ * properties. It also depends on devices instantiated in QEMU. These
+ * dependencies are not available during the instance_init phase of this
+ * object's life-cycle. As such, the server is initialized after the
+ * machine is setup. machine_init_done_notifier notifies TYPE_VFU_OBJECT
+ * when the machine is setup, and the dependencies are available.
+ */
+static void vfu_object_machine_done(Notifier *notifier, void *data)
+{
+    VfuObject *o = container_of(notifier, VfuObject, machine_done);
+    Error *err = NULL;
+
+    vfu_object_init_ctx(o, &err);
+
+    if (err) {
+        error_propagate(&error_abort, err);
+    }
+}
+
+static void vfu_object_init_ctx(VfuObject *o, Error **errp)
+{
+    ERRP_GUARD();
+
+    if (o->vfu_ctx || !o->socket || !o->device ||
+            !phase_check(PHASE_MACHINE_READY)) {
+        return;
+    }
+
+    if (o->err) {
+        error_propagate(errp, o->err);
+        o->err = NULL;
+        return;
+    }
+
+    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket->u.q_unix.path, 0,
+                                o, VFU_DEV_TYPE_PCI);
+    if (o->vfu_ctx == NULL) {
+        error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
+        return;
+    }
 }
 
 static void vfu_object_init(Object *obj)
@@ -124,6 +188,11 @@ static void vfu_object_init(Object *obj)
 
     k->nr_devs++;
 
+    if (!phase_check(PHASE_MACHINE_READY)) {
+        o->machine_done.notify = vfu_object_machine_done;
+        qemu_add_machine_init_done_notifier(&o->machine_done);
+    }
+
     if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
         error_setg(&o->err, "vfu: %s only compatible with %s machine",
                    TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
@@ -142,6 +211,10 @@ static void vfu_object_finalize(Object *obj)
 
     o->socket = NULL;
 
+    if (o->vfu_ctx) {
+        vfu_destroy_ctx(o->vfu_ctx);
+    }
+
     g_free(o->device);
 
     o->device = NULL;
@@ -149,6 +222,11 @@ static void vfu_object_finalize(Object *obj)
     if (!k->nr_devs && k->auto_shutdown) {
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
     }
+
+    if (o->machine_done.notify) {
+        qemu_remove_machine_init_done_notifier(&o->machine_done);
+        o->machine_done.notify = NULL;
+    }
 }
 
 static void vfu_object_class_init(ObjectClass *klass, void *data)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 11/18] vfio-user: find and init PCI device
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (9 preceding siblings ...)
  2022-01-19 21:41 ` [PATCH v5 10/18] vfio-user: instantiate vfio-user context Jagannathan Raman
@ 2022-01-19 21:42 ` Jagannathan Raman
  2022-01-25 14:48   ` Stefan Hajnoczi
  2022-01-19 21:42 ` [PATCH v5 12/18] vfio-user: run vfio-user context Jagannathan Raman
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Find the PCI device with specified id. Initialize the device context
with the QEMU PCI device

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 60 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 810a7c3943..10db78eb8d 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -44,6 +44,8 @@
 #include "qemu/notify.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
+#include "hw/qdev-core.h"
+#include "hw/pci/pci.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -89,6 +91,10 @@ struct VfuObject {
     Notifier machine_done;
 
     vfu_ctx_t *vfu_ctx;
+
+    PCIDevice *pci_dev;
+
+    Error *unplug_blocker;
 };
 
 static void vfu_object_init_ctx(VfuObject *o, Error **errp);
@@ -161,6 +167,9 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
 static void vfu_object_init_ctx(VfuObject *o, Error **errp)
 {
     ERRP_GUARD();
+    DeviceState *dev = NULL;
+    vfu_pci_type_t pci_type = VFU_PCI_TYPE_CONVENTIONAL;
+    int ret;
 
     if (o->vfu_ctx || !o->socket || !o->device ||
             !phase_check(PHASE_MACHINE_READY)) {
@@ -179,6 +188,49 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
         return;
     }
+
+    dev = qdev_find_recursive(sysbus_get_default(), o->device);
+    if (dev == NULL) {
+        error_setg(errp, "vfu: Device %s not found", o->device);
+        goto fail;
+    }
+
+    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
+        error_setg(errp, "vfu: %s not a PCI device", o->device);
+        goto fail;
+    }
+
+    o->pci_dev = PCI_DEVICE(dev);
+
+    if (pci_is_express(o->pci_dev)) {
+        pci_type = VFU_PCI_TYPE_EXPRESS;
+    }
+
+    ret = vfu_pci_init(o->vfu_ctx, pci_type, PCI_HEADER_TYPE_NORMAL, 0);
+    if (ret < 0) {
+        error_setg(errp,
+                   "vfu: Failed to attach PCI device %s to context - %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
+    error_setg(&o->unplug_blocker, "%s is in use", o->device);
+    qdev_add_unplug_blocker(DEVICE(o->pci_dev), o->unplug_blocker, errp);
+    if (*errp) {
+        goto fail;
+    }
+
+    return;
+
+fail:
+    vfu_destroy_ctx(o->vfu_ctx);
+    if (o->unplug_blocker && o->pci_dev) {
+        qdev_del_unplug_blocker(DEVICE(o->pci_dev), o->unplug_blocker);
+        error_free(o->unplug_blocker);
+        o->unplug_blocker = NULL;
+    }
+    o->vfu_ctx = NULL;
+    o->pci_dev = NULL;
 }
 
 static void vfu_object_init(Object *obj)
@@ -219,6 +271,14 @@ static void vfu_object_finalize(Object *obj)
 
     o->device = NULL;
 
+    if (o->unplug_blocker && o->pci_dev) {
+        qdev_del_unplug_blocker(DEVICE(o->pci_dev), o->unplug_blocker);
+        error_free(o->unplug_blocker);
+        o->unplug_blocker = NULL;
+    }
+
+    o->pci_dev = NULL;
+
     if (!k->nr_devs && k->auto_shutdown) {
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
     }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 12/18] vfio-user: run vfio-user context
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (10 preceding siblings ...)
  2022-01-19 21:42 ` [PATCH v5 11/18] vfio-user: find and init PCI device Jagannathan Raman
@ 2022-01-19 21:42 ` Jagannathan Raman
  2022-01-25 15:10   ` Stefan Hajnoczi
  2022-01-19 21:42 ` [PATCH v5 13/18] vfio-user: handle PCI config space accesses Jagannathan Raman
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Setup a handler to run vfio-user context. The context is driven by
messages to the file descriptor associated with it - get the fd for
the context and hook up the handler with it

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 qapi/misc.json            | 23 ++++++++++
 hw/remote/vfio-user-obj.c | 90 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/qapi/misc.json b/qapi/misc.json
index e8054f415b..f0791d3311 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -527,3 +527,26 @@
  'data': { '*option': 'str' },
  'returns': ['CommandLineOptionInfo'],
  'allow-preconfig': true }
+
+##
+# @VFU_CLIENT_HANGUP:
+#
+# Emitted when the client of a TYPE_VFIO_USER_SERVER closes the
+# communication channel
+#
+# @device: ID of attached PCI device
+#
+# @path: path of the socket
+#
+# Since: 6.3
+#
+# Example:
+#
+# <- { "event": "VFU_CLIENT_HANGUP",
+#      "data": { "device": "lsi1",
+#                "path": "/tmp/vfu1-sock" },
+#      "timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
+#
+##
+{ 'event': 'VFU_CLIENT_HANGUP',
+  'data': { 'device': 'str', 'path': 'str' } }
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 10db78eb8d..91d49a221f 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -27,6 +27,9 @@
  *
  * device - id of a device on the server, a required option. PCI devices
  *          alone are supported presently.
+ *
+ * notes - x-vfio-user-server could block IO and monitor during the
+ *         initialization phase.
  */
 
 #include "qemu/osdep.h"
@@ -41,11 +44,14 @@
 #include "hw/remote/machine.h"
 #include "qapi/error.h"
 #include "qapi/qapi-visit-sockets.h"
+#include "qapi/qapi-events-misc.h"
 #include "qemu/notify.h"
+#include "qemu/thread.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
 #include "hw/qdev-core.h"
 #include "hw/pci/pci.h"
+#include "qemu/timer.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -95,6 +101,8 @@ struct VfuObject {
     PCIDevice *pci_dev;
 
     Error *unplug_blocker;
+
+    int vfu_poll_fd;
 };
 
 static void vfu_object_init_ctx(VfuObject *o, Error **errp);
@@ -144,6 +152,68 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
     vfu_object_init_ctx(o, errp);
 }
 
+static void vfu_object_ctx_run(void *opaque)
+{
+    VfuObject *o = opaque;
+    int ret = -1;
+
+    while (ret != 0) {
+        ret = vfu_run_ctx(o->vfu_ctx);
+        if (ret < 0) {
+            if (errno == EINTR) {
+                continue;
+            } else if (errno == ENOTCONN) {
+                qapi_event_send_vfu_client_hangup(o->device,
+                                                  o->socket->u.q_unix.path);
+                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
+                o->vfu_poll_fd = -1;
+                object_unparent(OBJECT(o));
+                break;
+            } else {
+                VFU_OBJECT_ERROR(o, "vfu: Failed to run device %s - %s",
+                                 o->device, strerror(errno));
+                break;
+            }
+        }
+    }
+}
+
+static void vfu_object_attach_ctx(void *opaque)
+{
+    VfuObject *o = opaque;
+    GPollFD pfds[1];
+    int ret;
+
+    qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
+
+    pfds[0].fd = o->vfu_poll_fd;
+    pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
+
+retry_attach:
+    ret = vfu_attach_ctx(o->vfu_ctx);
+    if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
+        /**
+         * vfu_object_attach_ctx can block QEMU's main loop
+         * during attach - the monitor and other IO
+         * could be unresponsive during this time.
+         */
+        qemu_poll_ns(pfds, 1, 500 * (int64_t)SCALE_MS);
+        goto retry_attach;
+    } else if (ret < 0) {
+        VFU_OBJECT_ERROR(o, "vfu: Failed to attach device %s to context - %s",
+                         o->device, strerror(errno));
+        return;
+    }
+
+    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
+    if (o->vfu_poll_fd < 0) {
+        VFU_OBJECT_ERROR(o, "vfu: Failed to get poll fd %s", o->device);
+        return;
+    }
+
+    qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_ctx_run, NULL, o);
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -182,7 +252,8 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         return;
     }
 
-    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket->u.q_unix.path, 0,
+    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket->u.q_unix.path,
+                                LIBVFIO_USER_FLAG_ATTACH_NB,
                                 o, VFU_DEV_TYPE_PCI);
     if (o->vfu_ctx == NULL) {
         error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
@@ -220,6 +291,21 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    ret = vfu_realize_ctx(o->vfu_ctx);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to realize device %s- %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
+    o->vfu_poll_fd = vfu_get_poll_fd(o->vfu_ctx);
+    if (o->vfu_poll_fd < 0) {
+        error_setg(errp, "vfu: Failed to get poll fd %s", o->device);
+        goto fail;
+    }
+
+    qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_attach_ctx, NULL, o);
+
     return;
 
 fail:
@@ -250,6 +336,8 @@ static void vfu_object_init(Object *obj)
                    TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
         return;
     }
+
+    o->vfu_poll_fd = -1;
 }
 
 static void vfu_object_finalize(Object *obj)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 13/18] vfio-user: handle PCI config space accesses
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (11 preceding siblings ...)
  2022-01-19 21:42 ` [PATCH v5 12/18] vfio-user: run vfio-user context Jagannathan Raman
@ 2022-01-19 21:42 ` Jagannathan Raman
  2022-01-25 15:13   ` Stefan Hajnoczi
  2022-01-19 21:42 ` [PATCH v5 14/18] vfio-user: handle DMA mappings Jagannathan Raman
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Define and register handlers for PCI config space accesses

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 hw/remote/vfio-user-obj.c | 45 +++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 ++
 2 files changed, 47 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 91d49a221f..8951617545 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -47,6 +47,7 @@
 #include "qapi/qapi-events-misc.h"
 #include "qemu/notify.h"
 #include "qemu/thread.h"
+#include "qemu/main-loop.h"
 #include "sysemu/sysemu.h"
 #include "libvfio-user.h"
 #include "hw/qdev-core.h"
@@ -214,6 +215,39 @@ retry_attach:
     qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_ctx_run, NULL, o);
 }
 
+static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
+                                     size_t count, loff_t offset,
+                                     const bool is_write)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    uint32_t pci_access_width = sizeof(uint32_t);
+    size_t bytes = count;
+    uint32_t val = 0;
+    char *ptr = buf;
+    int len;
+
+    while (bytes > 0) {
+        len = (bytes > pci_access_width) ? pci_access_width : bytes;
+        if (is_write) {
+            memcpy(&val, ptr, len);
+            pci_host_config_write_common(o->pci_dev, offset,
+                                         pci_config_size(o->pci_dev),
+                                         val, len);
+            trace_vfu_cfg_write(offset, val);
+        } else {
+            val = pci_host_config_read_common(o->pci_dev, offset,
+                                              pci_config_size(o->pci_dev), len);
+            memcpy(ptr, &val, len);
+            trace_vfu_cfg_read(offset, val);
+        }
+        offset += len;
+        ptr += len;
+        bytes -= len;
+    }
+
+    return count;
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -291,6 +325,17 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_CFG_REGION_IDX,
+                           pci_config_size(o->pci_dev), &vfu_object_cfg_access,
+                           VFU_REGION_FLAG_RW | VFU_REGION_FLAG_ALWAYS_CB,
+                           NULL, 0, -1, 0);
+    if (ret < 0) {
+        error_setg(errp,
+                   "vfu: Failed to setup config space handlers for %s- %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 7da12f0d96..2ef7884346 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -5,3 +5,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d,
 
 # vfio-user-obj.c
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
+vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
+vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 14/18] vfio-user: handle DMA mappings
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (12 preceding siblings ...)
  2022-01-19 21:42 ` [PATCH v5 13/18] vfio-user: handle PCI config space accesses Jagannathan Raman
@ 2022-01-19 21:42 ` Jagannathan Raman
  2022-01-19 21:42 ` [PATCH v5 15/18] vfio-user: handle PCI BAR accesses Jagannathan Raman
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Define and register callbacks to manage the RAM regions used for
device DMA

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/remote/vfio-user-obj.c | 50 +++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  2 ++
 2 files changed, 52 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 8951617545..e690f1eaae 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -248,6 +248,49 @@ static ssize_t vfu_object_cfg_access(vfu_ctx_t *vfu_ctx, char * const buf,
     return count;
 }
 
+static void dma_register(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    MemoryRegion *subregion = NULL;
+    g_autofree char *name = NULL;
+    struct iovec *iov = &info->iova;
+
+    if (!info->vaddr) {
+        return;
+    }
+
+    name = g_strdup_printf("mem-%s-%"PRIx64"", o->device,
+                           (uint64_t)info->vaddr);
+
+    subregion = g_new0(MemoryRegion, 1);
+
+    memory_region_init_ram_ptr(subregion, NULL, name,
+                               iov->iov_len, info->vaddr);
+
+    memory_region_add_subregion(pci_address_space(o->pci_dev),
+                                (hwaddr)iov->iov_base, subregion);
+
+    trace_vfu_dma_register((uint64_t)iov->iov_base, iov->iov_len);
+}
+
+static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    MemoryRegion *mr = NULL;
+    ram_addr_t offset;
+
+    mr = memory_region_from_host(info->vaddr, &offset);
+    if (!mr) {
+        return;
+    }
+
+    memory_region_del_subregion(pci_address_space(o->pci_dev), mr);
+
+    object_unparent((OBJECT(mr)));
+
+    trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -336,6 +379,13 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    ret = vfu_setup_device_dma(o->vfu_ctx, &dma_register, &dma_unregister);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to setup DMA handlers for %s",
+                   o->device);
+        goto fail;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 2ef7884346..f945c7e33b 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -7,3 +7,5 @@ mpqemu_recv_io_error(int cmd, int size, int nfds) "failed to receive %d size %d,
 vfu_prop(const char *prop, const char *val) "vfu: setting %s as %s"
 vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
+vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
+vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 15/18] vfio-user: handle PCI BAR accesses
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (13 preceding siblings ...)
  2022-01-19 21:42 ` [PATCH v5 14/18] vfio-user: handle DMA mappings Jagannathan Raman
@ 2022-01-19 21:42 ` Jagannathan Raman
  2022-01-19 21:42 ` [PATCH v5 16/18] vfio-user: handle device interrupts Jagannathan Raman
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Determine the BARs used by the PCI device and register handlers to
manage the access to the same.

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/remote/vfio-user-obj.c | 92 +++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  3 ++
 2 files changed, 95 insertions(+)

diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index e690f1eaae..bf88eac8f1 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -291,6 +291,96 @@ static void dma_unregister(vfu_ctx_t *vfu_ctx, vfu_dma_info_t *info)
     trace_vfu_dma_unregister((uint64_t)info->iova.iov_base);
 }
 
+static ssize_t vfu_object_bar_rw(PCIDevice *pci_dev, hwaddr addr, size_t count,
+                                 char * const buf, const bool is_write,
+                                 bool is_io)
+{
+    AddressSpace *as = NULL;
+    MemTxResult res;
+
+    if (is_io) {
+        as = pci_isol_as_io(pci_dev);
+        as = as ? as : &address_space_io;
+    } else {
+        as = pci_isol_as_mem(pci_dev);
+        as = as ? as : &address_space_memory;
+    }
+
+    trace_vfu_bar_rw_enter(is_write ? "Write" : "Read", (uint64_t)addr);
+
+    res = address_space_rw(as, addr, MEMTXATTRS_UNSPECIFIED, (void *)buf,
+                           (hwaddr)count, is_write);
+    if (res != MEMTX_OK) {
+        warn_report("vfu: failed to %s 0x%"PRIx64"",
+                    is_write ? "write to" : "read from",
+                    addr);
+        return -1;
+    }
+
+    trace_vfu_bar_rw_exit(is_write ? "Write" : "Read", (uint64_t)addr);
+
+    return count;
+}
+
+/**
+ * VFU_OBJECT_BAR_HANDLER - macro for defining handlers for PCI BARs.
+ *
+ * To create handler for BAR number 2, VFU_OBJECT_BAR_HANDLER(2) would
+ * define vfu_object_bar2_handler
+ */
+#define VFU_OBJECT_BAR_HANDLER(BAR_NO)                                         \
+    static ssize_t vfu_object_bar##BAR_NO##_handler(vfu_ctx_t *vfu_ctx,        \
+                                        char * const buf, size_t count,        \
+                                        loff_t offset, const bool is_write)    \
+    {                                                                          \
+        VfuObject *o = vfu_get_private(vfu_ctx);                               \
+        PCIDevice *pci_dev = o->pci_dev;                                       \
+        hwaddr addr = (hwaddr)(pci_get_bar_addr(pci_dev, BAR_NO) + offset);    \
+        bool is_io = !!(pci_dev->io_regions[BAR_NO].type &                     \
+                        PCI_BASE_ADDRESS_SPACE);                               \
+                                                                               \
+        return vfu_object_bar_rw(pci_dev, addr, count, buf, is_write, is_io);  \
+    }                                                                          \
+
+VFU_OBJECT_BAR_HANDLER(0)
+VFU_OBJECT_BAR_HANDLER(1)
+VFU_OBJECT_BAR_HANDLER(2)
+VFU_OBJECT_BAR_HANDLER(3)
+VFU_OBJECT_BAR_HANDLER(4)
+VFU_OBJECT_BAR_HANDLER(5)
+
+static vfu_region_access_cb_t *vfu_object_bar_handlers[PCI_NUM_REGIONS] = {
+    &vfu_object_bar0_handler,
+    &vfu_object_bar1_handler,
+    &vfu_object_bar2_handler,
+    &vfu_object_bar3_handler,
+    &vfu_object_bar4_handler,
+    &vfu_object_bar5_handler,
+};
+
+/**
+ * vfu_object_register_bars - Identify active BAR regions of pdev and setup
+ *                            callbacks to handle read/write accesses
+ */
+static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
+{
+    int i;
+
+    for (i = 0; i < PCI_NUM_REGIONS; i++) {
+        if (!pdev->io_regions[i].size) {
+            continue;
+        }
+
+        vfu_setup_region(vfu_ctx, VFU_PCI_DEV_BAR0_REGION_IDX + i,
+                         (size_t)pdev->io_regions[i].size,
+                         vfu_object_bar_handlers[i],
+                         VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
+
+        trace_vfu_bar_register(i, pdev->io_regions[i].addr,
+                               pdev->io_regions[i].size);
+    }
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -386,6 +476,8 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index f945c7e33b..847d50d88f 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -9,3 +9,6 @@ vfu_cfg_read(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u -> 0x%x"
 vfu_cfg_write(uint32_t offset, uint32_t val) "vfu: cfg: 0x%u <- 0x%x"
 vfu_dma_register(uint64_t gpa, size_t len) "vfu: registering GPA 0x%"PRIx64", %zu bytes"
 vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
+vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64""
+vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
+vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 16/18] vfio-user: handle device interrupts
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (14 preceding siblings ...)
  2022-01-19 21:42 ` [PATCH v5 15/18] vfio-user: handle PCI BAR accesses Jagannathan Raman
@ 2022-01-19 21:42 ` Jagannathan Raman
  2022-01-25 15:25   ` Stefan Hajnoczi
  2022-01-19 21:42 ` [PATCH v5 17/18] vfio-user: register handlers to facilitate migration Jagannathan Raman
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Forward remote device's interrupts to the guest

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/hw/pci/pci.h      |  6 +++
 hw/pci/msi.c              | 13 +++++-
 hw/pci/msix.c             | 12 +++++-
 hw/remote/vfio-user-obj.c | 89 +++++++++++++++++++++++++++++++++++++++
 hw/remote/trace-events    |  1 +
 5 files changed, 119 insertions(+), 2 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 8c18f10d9d..092334d2af 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -128,6 +128,8 @@ typedef uint32_t PCIConfigReadFunc(PCIDevice *pci_dev,
 typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int region_num,
                                 pcibus_t addr, pcibus_t size, int type);
 typedef void PCIUnregisterFunc(PCIDevice *pci_dev);
+typedef void PCIMSINotify(PCIDevice *pci_dev, unsigned vector);
+typedef void PCIMSIxNotify(PCIDevice *pci_dev, unsigned vector);
 
 typedef struct PCIIORegion {
     pcibus_t addr; /* current PCI mapping address. -1 means not mapped */
@@ -322,6 +324,10 @@ struct PCIDevice {
     /* Space to store MSIX table & pending bit array */
     uint8_t *msix_table;
     uint8_t *msix_pba;
+
+    PCIMSINotify *msi_notify;
+    PCIMSIxNotify *msix_notify;
+
     /* MemoryRegion container for msix exclusive BAR setup */
     MemoryRegion msix_exclusive_bar;
     /* Memory Regions for MSIX table and pending bit entries. */
diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index 47d2b0f33c..93f5e400cc 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -51,6 +51,8 @@
  */
 bool msi_nonbroken;
 
+static void pci_msi_notify(PCIDevice *dev, unsigned int vector);
+
 /* If we get rid of cap allocator, we won't need this. */
 static inline uint8_t msi_cap_sizeof(uint16_t flags)
 {
@@ -225,6 +227,8 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
     dev->msi_cap = config_offset;
     dev->cap_present |= QEMU_PCI_CAP_MSI;
 
+    dev->msi_notify = pci_msi_notify;
+
     pci_set_word(dev->config + msi_flags_off(dev), flags);
     pci_set_word(dev->wmask + msi_flags_off(dev),
                  PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
@@ -307,7 +311,7 @@ bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
     return mask & (1U << vector);
 }
 
-void msi_notify(PCIDevice *dev, unsigned int vector)
+static void pci_msi_notify(PCIDevice *dev, unsigned int vector)
 {
     uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
     bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
@@ -332,6 +336,13 @@ void msi_notify(PCIDevice *dev, unsigned int vector)
     msi_send_message(dev, msg);
 }
 
+void msi_notify(PCIDevice *dev, unsigned int vector)
+{
+    if (dev->msi_notify) {
+        dev->msi_notify(dev, vector);
+    }
+}
+
 void msi_send_message(PCIDevice *dev, MSIMessage msg)
 {
     MemTxAttrs attrs = {};
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331cd0b..1c71e67f53 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -31,6 +31,8 @@
 #define MSIX_ENABLE_MASK (PCI_MSIX_FLAGS_ENABLE >> 8)
 #define MSIX_MASKALL_MASK (PCI_MSIX_FLAGS_MASKALL >> 8)
 
+static void pci_msix_notify(PCIDevice *dev, unsigned vector);
+
 MSIMessage msix_get_message(PCIDevice *dev, unsigned vector)
 {
     uint8_t *table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE;
@@ -334,6 +336,7 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries,
     dev->msix_table = g_malloc0(table_size);
     dev->msix_pba = g_malloc0(pba_size);
     dev->msix_entry_used = g_malloc0(nentries * sizeof *dev->msix_entry_used);
+    dev->msix_notify = pci_msix_notify;
 
     msix_mask_all(dev, nentries);
 
@@ -485,7 +488,7 @@ int msix_enabled(PCIDevice *dev)
 }
 
 /* Send an MSI-X message */
-void msix_notify(PCIDevice *dev, unsigned vector)
+static void pci_msix_notify(PCIDevice *dev, unsigned vector)
 {
     MSIMessage msg;
 
@@ -503,6 +506,13 @@ void msix_notify(PCIDevice *dev, unsigned vector)
     msi_send_message(dev, msg);
 }
 
+void msix_notify(PCIDevice *dev, unsigned vector)
+{
+    if (dev->msix_notify) {
+        dev->msix_notify(dev, vector);
+    }
+}
+
 void msix_reset(PCIDevice *dev)
 {
     if (!msix_present(dev)) {
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index bf88eac8f1..1771dba1bf 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -53,6 +53,8 @@
 #include "hw/qdev-core.h"
 #include "hw/pci/pci.h"
 #include "qemu/timer.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -106,6 +108,8 @@ struct VfuObject {
     int vfu_poll_fd;
 };
 
+static GHashTable *vfu_object_dev_to_ctx_table;
+
 static void vfu_object_init_ctx(VfuObject *o, Error **errp);
 
 static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
@@ -381,6 +385,72 @@ static void vfu_object_register_bars(vfu_ctx_t *vfu_ctx, PCIDevice *pdev)
     }
 }
 
+static int vfu_object_map_irq(PCIDevice *pci_dev, int intx)
+{
+    /*
+     * We only register one INTx interrupt with the server. map_irq
+     * callback is required for PCIBus.
+     */
+    return 0;
+}
+
+static void vfu_object_set_irq(void *opaque, int pirq, int level)
+{
+    vfu_ctx_t *vfu_ctx = opaque;
+
+    if (vfu_ctx && level) {
+        vfu_irq_trigger(vfu_ctx, 0);
+    }
+}
+
+static void vfu_object_msi_notify(PCIDevice *pci_dev, unsigned vector)
+{
+    vfu_ctx_t *vfu_ctx = NULL;
+
+    if (!vfu_object_dev_to_ctx_table) {
+        return;
+    }
+
+    vfu_ctx = g_hash_table_lookup(vfu_object_dev_to_ctx_table, pci_dev);
+
+    if (vfu_ctx) {
+        vfu_irq_trigger(vfu_ctx, vector);
+    }
+}
+
+static int vfu_object_setup_irqs(VfuObject *o, PCIDevice *pci_dev)
+{
+    vfu_ctx_t *vfu_ctx = o->vfu_ctx;
+    int ret;
+
+    ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_INTX_IRQ, 1);
+    if (ret < 0) {
+        return ret;
+    }
+
+    pci_bus_irqs(pci_get_bus(o->pci_dev), vfu_object_set_irq,
+                 vfu_object_map_irq, o->vfu_ctx, 1);
+
+    ret = 0;
+    if (msix_nr_vectors_allocated(pci_dev)) {
+        ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_MSIX_IRQ,
+                                       msix_nr_vectors_allocated(pci_dev));
+
+        pci_dev->msix_notify = vfu_object_msi_notify;
+    } else if (msi_nr_vectors_allocated(pci_dev)) {
+        ret = vfu_setup_device_nr_irqs(vfu_ctx, VFU_DEV_MSI_IRQ,
+                                       msi_nr_vectors_allocated(pci_dev));
+
+        pci_dev->msi_notify = vfu_object_msi_notify;
+    }
+
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
+}
+
 /*
  * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
  * properties. It also depends on devices instantiated in QEMU. These
@@ -478,6 +548,13 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
 
     vfu_object_register_bars(o->vfu_ctx, o->pci_dev);
 
+    ret = vfu_object_setup_irqs(o, o->pci_dev);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to setup interrupts for %s",
+                   o->device);
+        goto fail;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
@@ -491,6 +568,8 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    g_hash_table_insert(vfu_object_dev_to_ctx_table, o->pci_dev, o->vfu_ctx);
+
     qemu_set_fd_handler(o->vfu_poll_fd, vfu_object_attach_ctx, NULL, o);
 
     return;
@@ -552,9 +631,15 @@ static void vfu_object_finalize(Object *obj)
         o->unplug_blocker = NULL;
     }
 
+    if (o->pci_dev) {
+        g_hash_table_remove(vfu_object_dev_to_ctx_table, o->pci_dev);
+    }
+
     o->pci_dev = NULL;
 
     if (!k->nr_devs && k->auto_shutdown) {
+        g_hash_table_destroy(vfu_object_dev_to_ctx_table);
+        vfu_object_dev_to_ctx_table = NULL;
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
     }
 
@@ -572,6 +657,10 @@ static void vfu_object_class_init(ObjectClass *klass, void *data)
 
     k->auto_shutdown = true;
 
+    msi_nonbroken = true;
+
+    vfu_object_dev_to_ctx_table = g_hash_table_new_full(NULL, NULL, NULL, NULL);
+
     object_class_property_add(klass, "socket", "SocketAddress", NULL,
                               vfu_object_set_socket, NULL, NULL);
     object_class_property_set_description(klass, "socket",
diff --git a/hw/remote/trace-events b/hw/remote/trace-events
index 847d50d88f..c167b3c7a5 100644
--- a/hw/remote/trace-events
+++ b/hw/remote/trace-events
@@ -12,3 +12,4 @@ vfu_dma_unregister(uint64_t gpa) "vfu: unregistering GPA 0x%"PRIx64""
 vfu_bar_register(int i, uint64_t addr, uint64_t size) "vfu: BAR %d: addr 0x%"PRIx64" size 0x%"PRIx64""
 vfu_bar_rw_enter(const char *op, uint64_t addr) "vfu: %s request for BAR address 0x%"PRIx64""
 vfu_bar_rw_exit(const char *op, uint64_t addr) "vfu: Finished %s of BAR address 0x%"PRIx64""
+vfu_interrupt(int pirq) "vfu: sending interrupt to device - PIRQ %d"
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (15 preceding siblings ...)
  2022-01-19 21:42 ` [PATCH v5 16/18] vfio-user: handle device interrupts Jagannathan Raman
@ 2022-01-19 21:42 ` Jagannathan Raman
  2022-01-25 15:48   ` Stefan Hajnoczi
  2022-01-19 21:42 ` [PATCH v5 18/18] vfio-user: avocado tests for vfio-user Jagannathan Raman
  2022-01-25 16:00 ` [PATCH v5 00/18] vfio-user server in QEMU Stefan Hajnoczi
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Store and load the device's state during migration. use libvfio-user's
handlers for this purpose

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 include/migration/vmstate.h |   2 +
 migration/savevm.h          |   2 +
 hw/remote/vfio-user-obj.c   | 323 ++++++++++++++++++++++++++++++++++++
 migration/savevm.c          |  73 ++++++++
 migration/vmstate.c         |  19 +++
 5 files changed, 419 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 017c03675c..68bea576ea 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -1165,6 +1165,8 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_END_OF_LIST()                                         \
     {}
 
+uint64_t vmstate_vmsd_size(PCIDevice *pci_dev);
+
 int vmstate_load_state(QEMUFile *f, const VMStateDescription *vmsd,
                        void *opaque, int version_id);
 int vmstate_save_state(QEMUFile *f, const VMStateDescription *vmsd,
diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342cb4..8007064ff2 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -67,5 +67,7 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
+int qemu_remote_savevm(QEMUFile *f, DeviceState *dev);
+int qemu_remote_loadvm(QEMUFile *f);
 
 #endif
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index 1771dba1bf..d3c51577bd 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -55,6 +55,11 @@
 #include "qemu/timer.h"
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/vmstate.h"
+#include "migration/global_state.h"
+#include "block/block.h"
 
 #define TYPE_VFU_OBJECT "x-vfio-user-server"
 OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
@@ -106,6 +111,35 @@ struct VfuObject {
     Error *unplug_blocker;
 
     int vfu_poll_fd;
+
+    /*
+     * vfu_mig_buf holds the migration data. In the remote server, this
+     * buffer replaces the role of an IO channel which links the source
+     * and the destination.
+     *
+     * Whenever the client QEMU process initiates migration, the remote
+     * server gets notified via libvfio-user callbacks. The remote server
+     * sets up a QEMUFile object using this buffer as backend. The remote
+     * server passes this object to its migration subsystem, which slurps
+     * the VMSD of the device ('devid' above) referenced by this object
+     * and stores the VMSD in this buffer.
+     *
+     * The client subsequetly asks the remote server for any data that
+     * needs to be moved over to the destination via libvfio-user
+     * library's vfu_migration_callbacks_t callbacks. The remote hands
+     * over this buffer as data at this time.
+     *
+     * A reverse of this process happens at the destination.
+     */
+    uint8_t *vfu_mig_buf;
+
+    uint64_t vfu_mig_buf_size;
+
+    uint64_t vfu_mig_buf_pending;
+
+    QEMUFile *vfu_mig_file;
+
+    vfu_migr_state_t vfu_state;
 };
 
 static GHashTable *vfu_object_dev_to_ctx_table;
@@ -157,6 +191,272 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
     vfu_object_init_ctx(o, errp);
 }
 
+/**
+ * Migration helper functions
+ *
+ * vfu_mig_buf_read & vfu_mig_buf_write are used by QEMU's migration
+ * subsystem - qemu_remote_loadvm & qemu_remote_savevm. loadvm/savevm
+ * call these functions via QEMUFileOps to load/save the VMSD of a
+ * device into vfu_mig_buf
+ *
+ */
+static ssize_t vfu_mig_buf_read(void *opaque, uint8_t *buf, int64_t pos,
+                                size_t size, Error **errp)
+{
+    VfuObject *o = opaque;
+
+    if (pos > o->vfu_mig_buf_size) {
+        size = 0;
+    } else if ((pos + size) > o->vfu_mig_buf_size) {
+        size = o->vfu_mig_buf_size - pos;
+    }
+
+    memcpy(buf, (o->vfu_mig_buf + pos), size);
+
+    return size;
+}
+
+static ssize_t vfu_mig_buf_write(void *opaque, struct iovec *iov, int iovcnt,
+                                 int64_t pos, Error **errp)
+{
+    VfuObject *o = opaque;
+    uint64_t end = pos + iov_size(iov, iovcnt);
+    int i;
+
+    if (end > o->vfu_mig_buf_size) {
+        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
+    }
+
+    for (i = 0; i < iovcnt; i++) {
+        memcpy((o->vfu_mig_buf + o->vfu_mig_buf_size), iov[i].iov_base,
+               iov[i].iov_len);
+        o->vfu_mig_buf_size += iov[i].iov_len;
+        o->vfu_mig_buf_pending += iov[i].iov_len;
+    }
+
+    return iov_size(iov, iovcnt);
+}
+
+static int vfu_mig_buf_shutdown(void *opaque, bool rd, bool wr, Error **errp)
+{
+    VfuObject *o = opaque;
+
+    o->vfu_mig_buf_size = 0;
+
+    g_free(o->vfu_mig_buf);
+
+    o->vfu_mig_buf = NULL;
+
+    o->vfu_mig_buf_pending = 0;
+
+    return 0;
+}
+
+static const QEMUFileOps vfu_mig_fops_save = {
+    .writev_buffer  = vfu_mig_buf_write,
+    .shut_down      = vfu_mig_buf_shutdown,
+};
+
+static const QEMUFileOps vfu_mig_fops_load = {
+    .get_buffer     = vfu_mig_buf_read,
+    .shut_down      = vfu_mig_buf_shutdown,
+};
+
+/**
+ * handlers for vfu_migration_callbacks_t
+ *
+ * The libvfio-user library accesses these handlers to drive the migration
+ * at the remote end, and also to transport the data stored in vfu_mig_buf
+ *
+ */
+static void vfu_mig_state_stop_and_copy(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    int ret;
+
+    if (!o->vfu_mig_file) {
+        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_save, false);
+    }
+
+    ret = qemu_remote_savevm(o->vfu_mig_file, DEVICE(o->pci_dev));
+    if (ret) {
+        qemu_file_shutdown(o->vfu_mig_file);
+        o->vfu_mig_file = NULL;
+        return;
+    }
+
+    qemu_fflush(o->vfu_mig_file);
+}
+
+static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
+    static int migrated_devs;
+    Error *local_err = NULL;
+    int ret;
+
+    /**
+     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
+     * VMSD data from source is not available at RESUME state.
+     * Working on a fix for this.
+     */
+    if (!o->vfu_mig_file) {
+        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
+    }
+
+    ret = qemu_remote_loadvm(o->vfu_mig_file);
+    if (ret) {
+        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
+        return;
+    }
+
+    qemu_file_shutdown(o->vfu_mig_file);
+    o->vfu_mig_file = NULL;
+
+    /* VFU_MIGR_STATE_RUNNING begins here */
+    if (++migrated_devs == k->nr_devs) {
+        bdrv_invalidate_cache_all(&local_err);
+        if (local_err) {
+            error_report_err(local_err);
+            return;
+        }
+
+        vm_start();
+    }
+}
+
+static void vfu_mig_state_stop(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
+    static int migrated_devs;
+
+    /**
+     * note: calling bdrv_inactivate_all() is not the best approach.
+     *
+     *  Ideally, we would identify the block devices (if any) indirectly
+     *  linked (such as via a scsi-hd device) to each of the migrated devices,
+     *  and inactivate them individually. This is essential while operating
+     *  the server in a storage daemon mode, with devices from different VMs.
+     *
+     *  However, we currently don't have this capability. As such, we need to
+     *  inactivate all devices at the same time when migration is completed.
+     */
+    if (++migrated_devs == k->nr_devs) {
+        vm_stop(RUN_STATE_PAUSED);
+        bdrv_inactivate_all();
+    }
+}
+
+static int vfu_mig_transition(vfu_ctx_t *vfu_ctx, vfu_migr_state_t state)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (o->vfu_state == state) {
+        return 0;
+    }
+
+    switch (state) {
+    case VFU_MIGR_STATE_RESUME:
+        break;
+    case VFU_MIGR_STATE_STOP_AND_COPY:
+        vfu_mig_state_stop_and_copy(vfu_ctx);
+        break;
+    case VFU_MIGR_STATE_STOP:
+        vfu_mig_state_stop(vfu_ctx);
+        break;
+    case VFU_MIGR_STATE_PRE_COPY:
+        break;
+    case VFU_MIGR_STATE_RUNNING:
+        if (!runstate_is_running()) {
+            vfu_mig_state_running(vfu_ctx);
+        }
+        break;
+    default:
+        warn_report("vfu: Unknown migration state %d", state);
+    }
+
+    o->vfu_state = state;
+
+    return 0;
+}
+
+static uint64_t vfu_mig_get_pending_bytes(vfu_ctx_t *vfu_ctx)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    return o->vfu_mig_buf_pending;
+}
+
+static int vfu_mig_prepare_data(vfu_ctx_t *vfu_ctx, uint64_t *offset,
+                                uint64_t *size)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (offset) {
+        *offset = 0;
+    }
+
+    if (size) {
+        *size = o->vfu_mig_buf_size;
+    }
+
+    return 0;
+}
+
+static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
+                                 uint64_t size, uint64_t offset)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+
+    if (offset > o->vfu_mig_buf_size) {
+        return -1;
+    }
+
+    if ((offset + size) > o->vfu_mig_buf_size) {
+        warn_report("vfu: buffer overflow - check pending_bytes");
+        size = o->vfu_mig_buf_size - offset;
+    }
+
+    memcpy(buf, (o->vfu_mig_buf + offset), size);
+
+    o->vfu_mig_buf_pending -= size;
+
+    return size;
+}
+
+static ssize_t vfu_mig_write_data(vfu_ctx_t *vfu_ctx, void *data,
+                                  uint64_t size, uint64_t offset)
+{
+    VfuObject *o = vfu_get_private(vfu_ctx);
+    uint64_t end = offset + size;
+
+    if (end > o->vfu_mig_buf_size) {
+        o->vfu_mig_buf = g_realloc(o->vfu_mig_buf, end);
+        o->vfu_mig_buf_size = end;
+    }
+
+    memcpy((o->vfu_mig_buf + offset), data, size);
+
+    return size;
+}
+
+static int vfu_mig_data_written(vfu_ctx_t *vfu_ctx, uint64_t count)
+{
+    return 0;
+}
+
+static const vfu_migration_callbacks_t vfu_mig_cbs = {
+    .version = VFU_MIGR_CALLBACKS_VERS,
+    .transition = &vfu_mig_transition,
+    .get_pending_bytes = &vfu_mig_get_pending_bytes,
+    .prepare_data = &vfu_mig_prepare_data,
+    .read_data = &vfu_mig_read_data,
+    .data_written = &vfu_mig_data_written,
+    .write_data = &vfu_mig_write_data,
+};
+
 static void vfu_object_ctx_run(void *opaque)
 {
     VfuObject *o = opaque;
@@ -476,6 +776,7 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
     ERRP_GUARD();
     DeviceState *dev = NULL;
     vfu_pci_type_t pci_type = VFU_PCI_TYPE_CONVENTIONAL;
+    uint64_t migr_regs_size, migr_size;
     int ret;
 
     if (o->vfu_ctx || !o->socket || !o->device ||
@@ -555,6 +856,26 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
         goto fail;
     }
 
+    migr_regs_size = vfu_get_migr_register_area_size();
+    migr_size = migr_regs_size + vmstate_vmsd_size(o->pci_dev);
+
+    ret = vfu_setup_region(o->vfu_ctx, VFU_PCI_DEV_MIGR_REGION_IDX,
+                           migr_size, NULL,
+                           VFU_REGION_FLAG_RW, NULL, 0, -1, 0);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to register migration BAR %s- %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
+    ret = vfu_setup_device_migration_callbacks(o->vfu_ctx, &vfu_mig_cbs,
+                                               migr_regs_size);
+    if (ret < 0) {
+        error_setg(errp, "vfu: Failed to setup migration %s- %s",
+                   o->device, strerror(errno));
+        goto fail;
+    }
+
     ret = vfu_realize_ctx(o->vfu_ctx);
     if (ret < 0) {
         error_setg(errp, "vfu: Failed to realize device %s- %s",
@@ -604,6 +925,8 @@ static void vfu_object_init(Object *obj)
     }
 
     o->vfu_poll_fd = -1;
+
+    o->vfu_state = VFU_MIGR_STATE_STOP;
 }
 
 static void vfu_object_finalize(Object *obj)
diff --git a/migration/savevm.c b/migration/savevm.c
index 0bef031acb..be119e2e59 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1605,6 +1605,49 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
     return ret;
 }
 
+static SaveStateEntry *find_se_from_dev(DeviceState *dev)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (se->opaque == dev) {
+            return se;
+        }
+    }
+
+    return NULL;
+}
+
+int qemu_remote_savevm(QEMUFile *f, DeviceState *dev)
+{
+    SaveStateEntry *se;
+    int ret = 0;
+
+    se = find_se_from_dev(dev);
+    if (!se) {
+        return -ENODEV;
+    }
+
+    if (!se->vmsd || !vmstate_save_needed(se->vmsd, se->opaque)) {
+        return ret;
+    }
+
+    save_section_header(f, se, QEMU_VM_SECTION_FULL);
+
+    ret = vmstate_save(f, se, NULL);
+    if (ret) {
+        qemu_file_set_error(f, ret);
+        return ret;
+    }
+
+    save_section_footer(f, se);
+
+    qemu_put_byte(f, QEMU_VM_EOF);
+    qemu_fflush(f);
+
+    return 0;
+}
+
 void qemu_savevm_live_state(QEMUFile *f)
 {
     /* save QEMU_VM_SECTION_END section */
@@ -2446,6 +2489,36 @@ qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis)
     return 0;
 }
 
+int qemu_remote_loadvm(QEMUFile *f)
+{
+    uint8_t section_type;
+    int ret = 0;
+
+    while (true) {
+        section_type = qemu_get_byte(f);
+
+        ret = qemu_file_get_error(f);
+        if (ret) {
+            break;
+        }
+
+        switch (section_type) {
+        case QEMU_VM_SECTION_FULL:
+            ret = qemu_loadvm_section_start_full(f, NULL);
+            if (ret < 0) {
+                break;
+            }
+            break;
+        case QEMU_VM_EOF:
+            return ret;
+        default:
+            return -EINVAL;
+        }
+    }
+
+    return ret;
+}
+
 static int
 qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
 {
diff --git a/migration/vmstate.c b/migration/vmstate.c
index 05f87cdddc..83f8562792 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -63,6 +63,25 @@ static int vmstate_size(void *opaque, const VMStateField *field)
     return size;
 }
 
+uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
+{
+    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
+    const VMStateField *field = NULL;
+    uint64_t size = 0;
+
+    if (!dc->vmsd) {
+        return 0;
+    }
+
+    field = dc->vmsd->fields;
+    while (field && field->name) {
+        size += vmstate_size(pci_dev, field);
+        field++;
+    }
+
+    return size;
+}
+
 static void vmstate_handle_alloc(void *ptr, const VMStateField *field,
                                  void *opaque)
 {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v5 18/18] vfio-user: avocado tests for vfio-user
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (16 preceding siblings ...)
  2022-01-19 21:42 ` [PATCH v5 17/18] vfio-user: register handlers to facilitate migration Jagannathan Raman
@ 2022-01-19 21:42 ` Jagannathan Raman
  2022-01-26  4:25   ` Philippe Mathieu-Daudé via
  2022-01-25 16:00 ` [PATCH v5 00/18] vfio-user server in QEMU Stefan Hajnoczi
  18 siblings, 1 reply; 99+ messages in thread
From: Jagannathan Raman @ 2022-01-19 21:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, f4bug, marcandre.lureau,
	stefanha, thanos.makatos, pbonzini, jag.raman, eblake, dgilbert

Avocado tests for libvfio-user in QEMU - tests startup,
hotplug and migration of the server object

Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
 MAINTAINERS                |   1 +
 tests/avocado/vfio-user.py | 225 +++++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 tests/avocado/vfio-user.py

diff --git a/MAINTAINERS b/MAINTAINERS
index 93bce3fa62..9ef9e1f75a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3488,6 +3488,7 @@ F: hw/remote/iohub.c
 F: include/hw/remote/iohub.h
 F: subprojects/libvfio-user
 F: hw/remote/vfio-user-obj.c
+F: tests/avocado/vfio-user.py
 
 EBPF:
 M: Jason Wang <jasowang@redhat.com>
diff --git a/tests/avocado/vfio-user.py b/tests/avocado/vfio-user.py
new file mode 100644
index 0000000000..376c02c41f
--- /dev/null
+++ b/tests/avocado/vfio-user.py
@@ -0,0 +1,225 @@
+# vfio-user protocol sanity test
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+
+import os
+import socket
+import uuid
+
+from avocado_qemu import QemuSystemTest
+from avocado_qemu import wait_for_console_pattern
+from avocado_qemu import exec_command
+from avocado_qemu import exec_command_and_wait_for_pattern
+
+from avocado.utils import network
+from avocado.utils import wait
+
+class VfioUser(QemuSystemTest):
+    """
+    :avocado: tags=vfiouser
+    """
+    KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 '
+    timeout = 20
+
+    @staticmethod
+    def migration_finished(vm):
+        res = vm.command('query-migrate')
+        if 'status' in res:
+            return res['status'] in ('completed', 'failed')
+        else:
+            return False
+
+    def _get_free_port(self):
+        port = network.find_free_port()
+        if port is None:
+            self.cancel('Failed to find a free port')
+        return port
+
+    def validate_vm_launch(self, vm):
+        wait_for_console_pattern(self, 'as init process',
+                                 'Kernel panic - not syncing', vm=vm)
+        exec_command(self, 'mount -t sysfs sysfs /sys', vm=vm)
+        exec_command_and_wait_for_pattern(self,
+                                          'cat /sys/bus/pci/devices/*/uevent',
+                                          'PCI_ID=1000:0012', vm=vm)
+
+    def launch_server_startup(self, socket, *opts):
+        server_vm = self.get_vm()
+        server_vm.add_args('-machine', 'x-remote')
+        server_vm.add_args('-nodefaults')
+        server_vm.add_args('-device', 'lsi53c895a,id=lsi1')
+        server_vm.add_args('-object', 'x-vfio-user-server,id=vfioobj1,'
+                           'type=unix,path='+socket+',device=lsi1')
+        for opt in opts:
+            server_vm.add_args(opt)
+        server_vm.launch()
+        return server_vm
+
+    def launch_server_hotplug(self, socket):
+        server_vm = self.get_vm()
+        server_vm.add_args('-machine', 'x-remote')
+        server_vm.add_args('-nodefaults')
+        server_vm.add_args('-device', 'lsi53c895a,id=lsi1')
+        server_vm.launch()
+        server_vm.command('human-monitor-command',
+                          command_line='object_add x-vfio-user-server,'
+                                       'id=vfioobj,socket.type=unix,'
+                                       'socket.path='+socket+',device=lsi1')
+        return server_vm
+
+    def launch_client(self, kernel_path, initrd_path, kernel_command_line,
+                      machine_type, socket, *opts):
+        client_vm = self.get_vm()
+        client_vm.set_console()
+        client_vm.add_args('-machine', machine_type)
+        client_vm.add_args('-accel', 'kvm')
+        client_vm.add_args('-cpu', 'host')
+        client_vm.add_args('-object',
+                           'memory-backend-memfd,id=sysmem-file,size=2G')
+        client_vm.add_args('--numa', 'node,memdev=sysmem-file')
+        client_vm.add_args('-m', '2048')
+        client_vm.add_args('-kernel', kernel_path,
+                           '-initrd', initrd_path,
+                           '-append', kernel_command_line)
+        client_vm.add_args('-device',
+                           'vfio-user-pci,x-enable-migration=true,'
+                           'socket='+socket)
+        for opt in opts:
+            client_vm.add_args(opt)
+        client_vm.launch()
+        return client_vm
+
+    def do_test_startup(self, kernel_url, initrd_url, kernel_command_line,
+                machine_type):
+        self.require_accelerator('kvm')
+
+        kernel_path = self.fetch_asset(kernel_url)
+        initrd_path = self.fetch_asset(initrd_url)
+        socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(socket):
+            os.remove(socket)
+        self.launch_server_startup(socket)
+        client = self.launch_client(kernel_path, initrd_path,
+                                    kernel_command_line, machine_type, socket)
+        self.validate_vm_launch(client)
+
+    def do_test_hotplug(self, kernel_url, initrd_url, kernel_command_line,
+                machine_type):
+        self.require_accelerator('kvm')
+
+        kernel_path = self.fetch_asset(kernel_url)
+        initrd_path = self.fetch_asset(initrd_url)
+        socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(socket):
+            os.remove(socket)
+        self.launch_server_hotplug(socket)
+        client = self.launch_client(kernel_path, initrd_path,
+                                    kernel_command_line, machine_type, socket)
+        self.validate_vm_launch(client)
+
+    def do_test_migrate(self, kernel_url, initrd_url, kernel_command_line,
+                machine_type):
+        self.require_accelerator('kvm')
+
+        kernel_path = self.fetch_asset(kernel_url)
+        initrd_path = self.fetch_asset(initrd_url)
+        srv_socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(srv_socket):
+            os.remove(srv_socket)
+        dst_socket = os.path.join('/tmp', str(uuid.uuid4()))
+        if os.path.exists(dst_socket):
+            os.remove(dst_socket)
+        client_uri = 'tcp:localhost:%u' % self._get_free_port()
+        server_uri = 'tcp:localhost:%u' % self._get_free_port()
+
+        """ Launch destination VM """
+        self.launch_server_startup(dst_socket, '-incoming', server_uri)
+        dst_client = self.launch_client(kernel_path, initrd_path,
+                                        kernel_command_line, machine_type,
+                                        dst_socket, '-incoming', client_uri)
+
+        """ Launch source VM """
+        self.launch_server_startup(srv_socket)
+        src_client = self.launch_client(kernel_path, initrd_path,
+                                        kernel_command_line, machine_type,
+                                        srv_socket)
+        self.validate_vm_launch(src_client)
+
+        """ Kick off migration """
+        src_client.qmp('migrate', uri=client_uri)
+
+        wait.wait_for(self.migration_finished,
+                      timeout=self.timeout,
+                      step=0.1,
+                      args=(dst_client,))
+
+    def test_vfio_user_x86_64(self):
+        """
+        :avocado: tags=arch:x86_64
+        :avocado: tags=distro:centos
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'console=ttyS0 rdinit=/bin/bash')
+        machine_type = 'pc'
+        self.do_test_startup(kernel_url, initrd_url, kernel_command_line,
+                             machine_type)
+
+    def test_vfio_user_aarch64(self):
+        """
+        :avocado: tags=arch:aarch64
+        :avocado: tags=distro:ubuntu
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/aarch64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/aarch64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'rdinit=/bin/bash console=ttyAMA0')
+        machine_type = 'virt,gic-version=3'
+        self.do_test_startup(kernel_url, initrd_url, kernel_command_line,
+                             machine_type)
+
+    def test_vfio_user_hotplug_x86_64(self):
+        """
+        :avocado: tags=arch:x86_64
+        :avocado: tags=distro:centos
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'console=ttyS0 rdinit=/bin/bash')
+        machine_type = 'pc'
+        self.do_test_hotplug(kernel_url, initrd_url, kernel_command_line,
+                             machine_type)
+
+    def test_vfio_user_migrate_x86_64(self):
+        """
+        :avocado: tags=arch:x86_64
+        :avocado: tags=distro:centos
+        """
+        kernel_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/vmlinuz')
+        initrd_url = ('https://archives.fedoraproject.org/pub/archive/fedora'
+                      '/linux/releases/31/Everything/x86_64/os/images'
+                      '/pxeboot/initrd.img')
+        kernel_command_line = (self.KERNEL_COMMON_COMMAND_LINE +
+                               'console=ttyS0 rdinit=/bin/bash')
+        machine_type = 'pc'
+        self.do_test_migrate(kernel_url, initrd_url, kernel_command_line,
+                             machine_type)
+
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-19 21:41 ` [PATCH v5 03/18] pci: isolated address space for PCI bus Jagannathan Raman
@ 2022-01-20  0:12   ` Michael S. Tsirkin
  2022-01-20 15:20     ` Jag Raman
  2022-01-25  9:56   ` Stefan Hajnoczi
  1 sibling, 1 reply; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-01-20  0:12 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	quintela, armbru, john.levon, qemu-devel, f4bug,
	marcandre.lureau, dgilbert, stefanha, thanos.makatos, pbonzini,
	eblake

On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> Allow PCI buses to be part of isolated CPU address spaces. This has a
> niche usage.
> 
> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> the same machine/server. This would cause address space collision as
> well as be a security vulnerability. Having separate address spaces for
> each PCI bus would solve this problem.

Fascinating, but I am not sure I understand. any examples?

I also wonder whether this special type could be modelled like a special
kind of iommu internally.

> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/hw/pci/pci.h     |  2 ++
>  include/hw/pci/pci_bus.h | 17 +++++++++++++++++
>  hw/pci/pci.c             | 17 +++++++++++++++++
>  hw/pci/pci_bridge.c      |  5 +++++
>  4 files changed, 41 insertions(+)
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 023abc0f79..9bb4472abc 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
>  int pci_device_load(PCIDevice *s, QEMUFile *f);
>  MemoryRegion *pci_address_space(PCIDevice *dev);
>  MemoryRegion *pci_address_space_io(PCIDevice *dev);
> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
>  
>  /*
>   * Should not normally be used by devices. For use by sPAPR target
> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> index 347440d42c..d78258e79e 100644
> --- a/include/hw/pci/pci_bus.h
> +++ b/include/hw/pci/pci_bus.h
> @@ -39,9 +39,26 @@ struct PCIBus {
>      void *irq_opaque;
>      PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
>      PCIDevice *parent_dev;
> +
>      MemoryRegion *address_space_mem;
>      MemoryRegion *address_space_io;
>  
> +    /**
> +     * Isolated address spaces - these allow the PCI bus to be part
> +     * of an isolated address space as opposed to the global
> +     * address_space_memory & address_space_io.

Are you sure address_space_memory & address_space_io are
always global? even in the case of an iommu?

> This allows the
> +     * bus to be attached to CPUs from different machines. The
> +     * following is not used used commonly.
> +     *
> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
> +     * VM clients,

what are VM clients?

> as such it needs the PCI buses in the same machine
> +     * to be part of different CPU address spaces. The following is
> +     * useful in that scenario.
> +     *
> +     */
> +    AddressSpace *isol_as_mem;
> +    AddressSpace *isol_as_io;
> +
>      QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
>      QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
>  
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 5d30f9ca60..d5f1c6c421 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
>      bus->slot_reserved_mask = 0x0;
>      bus->address_space_mem = address_space_mem;
>      bus->address_space_io = address_space_io;
> +    bus->isol_as_mem = NULL;
> +    bus->isol_as_io = NULL;
>      bus->flags |= PCI_BUS_IS_ROOT;
>  
>      /* host bridge */
> @@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
>      return pci_get_bus(dev)->address_space_io;
>  }
>  
> +AddressSpace *pci_isol_as_mem(PCIDevice *dev)
> +{
> +    return pci_get_bus(dev)->isol_as_mem;
> +}
> +
> +AddressSpace *pci_isol_as_io(PCIDevice *dev)
> +{
> +    return pci_get_bus(dev)->isol_as_io;
> +}
> +
>  static void pci_device_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *k = DEVICE_CLASS(klass);
> @@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>  
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  {
> +    AddressSpace *iommu_as = NULL;
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
>      uint8_t devfn = dev->devfn;
> @@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
>          return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>      }
> +    iommu_as = pci_isol_as_mem(dev);
> +    if (iommu_as) {
> +        return iommu_as;
> +    }
>      return &address_space_memory;
>  }
>  
> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> index da34c8ebcd..98366768d2 100644
> --- a/hw/pci/pci_bridge.c
> +++ b/hw/pci/pci_bridge.c
> @@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
>      sec_bus->address_space_io = &br->address_space_io;
>      memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
>                         4 * GiB);
> +
> +    /* This PCI bridge puts the sec_bus in its parent's address space */
> +    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
> +    sec_bus->isol_as_io = pci_isol_as_io(dev);
> +
>      br->windows = pci_bridge_region_init(br);
>      QLIST_INIT(&sec_bus->child);
>      QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
> -- 
> 2.20.1



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 01/18] configure, meson: override C compiler for cmake
  2022-01-19 21:41 ` [PATCH v5 01/18] configure, meson: override C compiler for cmake Jagannathan Raman
@ 2022-01-20 13:27   ` Paolo Bonzini
  2022-01-20 15:21     ` Jag Raman
  2022-02-17  6:10     ` Jag Raman
  0 siblings, 2 replies; 99+ messages in thread
From: Paolo Bonzini @ 2022-01-20 13:27 UTC (permalink / raw)
  To: Jagannathan Raman, qemu-devel
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, f4bug, quintela, marcandre.lureau,
	stefanha, thanos.makatos, eblake, dgilbert

On 1/19/22 22:41, Jagannathan Raman wrote:
> The compiler path that cmake gets from meson is corrupted. It results in
> the following error:
> | -- The C compiler identification is unknown
> | CMake Error at CMakeLists.txt:35 (project):
> | The CMAKE_C_COMPILER:
> | /opt/rh/devtoolset-9/root/bin/cc;-m64;-mcx16
> | is not a full path to an existing compiler tool.
> 
> Explicitly specify the C compiler for cmake to avoid this error
> 
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>

This should not be needed anymore, as the bug in Meson has been fixed.

Paolo

>   configure | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/configure b/configure
> index e1a31fb332..6a865f8713 100755
> --- a/configure
> +++ b/configure
> @@ -3747,6 +3747,8 @@ if test "$skip_meson" = no; then
>     echo "cpp_args = [$(meson_quote $CXXFLAGS $EXTRA_CXXFLAGS)]" >> $cross
>     echo "c_link_args = [$(meson_quote $CFLAGS $LDFLAGS $EXTRA_CFLAGS $EXTRA_LDFLAGS)]" >> $cross
>     echo "cpp_link_args = [$(meson_quote $CXXFLAGS $LDFLAGS $EXTRA_CXXFLAGS $EXTRA_LDFLAGS)]" >> $cross
> +  echo "[cmake]" >> $cross
> +  echo "CMAKE_C_COMPILER = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
>     echo "[binaries]" >> $cross
>     echo "c = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
>     test -n "$cxx" && echo "cpp = [$(meson_quote $cxx $CPU_CFLAGS)]" >> $cross



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-20  0:12   ` Michael S. Tsirkin
@ 2022-01-20 15:20     ` Jag Raman
  2022-01-25 18:38       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-01-20 15:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, quintela, armbru, john.levon, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, dgilbert, Stefan Hajnoczi,
	thanos.makatos, Paolo Bonzini, Eric Blake



> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
>> Allow PCI buses to be part of isolated CPU address spaces. This has a
>> niche usage.
>> 
>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
>> the same machine/server. This would cause address space collision as
>> well as be a security vulnerability. Having separate address spaces for
>> each PCI bus would solve this problem.
> 
> Fascinating, but I am not sure I understand. any examples?

Hi Michael!

multiprocess QEMU and vfio-user implement a client-server model to allow
out-of-process emulation of devices. The client QEMU, which makes ioctls
to the kernel and runs VCPUs, could attach devices running in a server
QEMU. The server QEMU needs access to parts of the client’s RAM to
perform DMA.

In the case where multiple clients attach devices that are running on the
same server, we need to ensure that each devices has isolated memory
ranges. This ensures that the memory space of one device is not visible
to other devices in the same server.
 
> 
> I also wonder whether this special type could be modelled like a special
> kind of iommu internally.

Could you please provide some more details on the design?

> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> include/hw/pci/pci.h     |  2 ++
>> include/hw/pci/pci_bus.h | 17 +++++++++++++++++
>> hw/pci/pci.c             | 17 +++++++++++++++++
>> hw/pci/pci_bridge.c      |  5 +++++
>> 4 files changed, 41 insertions(+)
>> 
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index 023abc0f79..9bb4472abc 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
>> int pci_device_load(PCIDevice *s, QEMUFile *f);
>> MemoryRegion *pci_address_space(PCIDevice *dev);
>> MemoryRegion *pci_address_space_io(PCIDevice *dev);
>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
>> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
>> 
>> /*
>>  * Should not normally be used by devices. For use by sPAPR target
>> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
>> index 347440d42c..d78258e79e 100644
>> --- a/include/hw/pci/pci_bus.h
>> +++ b/include/hw/pci/pci_bus.h
>> @@ -39,9 +39,26 @@ struct PCIBus {
>>     void *irq_opaque;
>>     PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
>>     PCIDevice *parent_dev;
>> +
>>     MemoryRegion *address_space_mem;
>>     MemoryRegion *address_space_io;
>> 
>> +    /**
>> +     * Isolated address spaces - these allow the PCI bus to be part
>> +     * of an isolated address space as opposed to the global
>> +     * address_space_memory & address_space_io.
> 
> Are you sure address_space_memory & address_space_io are
> always global? even in the case of an iommu?

On the CPU side of the Root Complex, I believe address_space_memory
& address_space_io are global.

In the vfio-user case, devices on the same machine (TYPE_REMOTE_MACHINE)
could be attached to different clients VMs. Each client would have their own address
space for their CPUs. With isolated address spaces, we ensure that the devices
see the address space of the CPUs they’re attached to.

Not sure if it’s OK to share weblinks in this mailing list, please let me know if that’s
not preferred. But I’m referring to the terminology used in the following block diagram:
https://en.wikipedia.org/wiki/Root_complex#/media/File:Example_PCI_Express_Topology.svg

> 
>> This allows the
>> +     * bus to be attached to CPUs from different machines. The
>> +     * following is not used used commonly.
>> +     *
>> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
>> +     * VM clients,
> 
> what are VM clients?

It’s the client in the client - server model explained above.

Thank you!
--
Jag

> 
>> as such it needs the PCI buses in the same machine
>> +     * to be part of different CPU address spaces. The following is
>> +     * useful in that scenario.
>> +     *
>> +     */
>> +    AddressSpace *isol_as_mem;
>> +    AddressSpace *isol_as_io;
>> +
>>     QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
>>     QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
>> 
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 5d30f9ca60..d5f1c6c421 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
>>     bus->slot_reserved_mask = 0x0;
>>     bus->address_space_mem = address_space_mem;
>>     bus->address_space_io = address_space_io;
>> +    bus->isol_as_mem = NULL;
>> +    bus->isol_as_io = NULL;
>>     bus->flags |= PCI_BUS_IS_ROOT;
>> 
>>     /* host bridge */
>> @@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
>>     return pci_get_bus(dev)->address_space_io;
>> }
>> 
>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev)
>> +{
>> +    return pci_get_bus(dev)->isol_as_mem;
>> +}
>> +
>> +AddressSpace *pci_isol_as_io(PCIDevice *dev)
>> +{
>> +    return pci_get_bus(dev)->isol_as_io;
>> +}
>> +
>> static void pci_device_class_init(ObjectClass *klass, void *data)
>> {
>>     DeviceClass *k = DEVICE_CLASS(klass);
>> @@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>> 
>> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> {
>> +    AddressSpace *iommu_as = NULL;
>>     PCIBus *bus = pci_get_bus(dev);
>>     PCIBus *iommu_bus = bus;
>>     uint8_t devfn = dev->devfn;
>> @@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>     if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
>>         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>>     }
>> +    iommu_as = pci_isol_as_mem(dev);
>> +    if (iommu_as) {
>> +        return iommu_as;
>> +    }
>>     return &address_space_memory;
>> }
>> 
>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
>> index da34c8ebcd..98366768d2 100644
>> --- a/hw/pci/pci_bridge.c
>> +++ b/hw/pci/pci_bridge.c
>> @@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
>>     sec_bus->address_space_io = &br->address_space_io;
>>     memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
>>                        4 * GiB);
>> +
>> +    /* This PCI bridge puts the sec_bus in its parent's address space */
>> +    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
>> +    sec_bus->isol_as_io = pci_isol_as_io(dev);
>> +
>>     br->windows = pci_bridge_region_init(br);
>>     QLIST_INIT(&sec_bus->child);
>>     QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
>> -- 
>> 2.20.1


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 01/18] configure, meson: override C compiler for cmake
  2022-01-20 13:27   ` Paolo Bonzini
@ 2022-01-20 15:21     ` Jag Raman
  2022-02-17  6:10     ` Jag Raman
  1 sibling, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-20 15:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: eduardo, Elena Ufimtseva, berrange, bleal, John Johnson,
	john.levon, qemu-devel, armbru, quintela, mst, stefanha,
	thanos.makatos, marcandre.lureau, eblake, dgilbert, f4bug



> On Jan 20, 2022, at 8:27 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 1/19/22 22:41, Jagannathan Raman wrote:
>> The compiler path that cmake gets from meson is corrupted. It results in
>> the following error:
>> | -- The C compiler identification is unknown
>> | CMake Error at CMakeLists.txt:35 (project):
>> | The CMAKE_C_COMPILER:
>> | /opt/rh/devtoolset-9/root/bin/cc;-m64;-mcx16
>> | is not a full path to an existing compiler tool.
>> Explicitly specify the C compiler for cmake to avoid this error
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> This should not be needed anymore, as the bug in Meson has been fixed.

OK, will drop this patch.

Thank you!

> 
> Paolo
> 
>>  configure | 2 ++
>>  1 file changed, 2 insertions(+)
>> diff --git a/configure b/configure
>> index e1a31fb332..6a865f8713 100755
>> --- a/configure
>> +++ b/configure
>> @@ -3747,6 +3747,8 @@ if test "$skip_meson" = no; then
>>    echo "cpp_args = [$(meson_quote $CXXFLAGS $EXTRA_CXXFLAGS)]" >> $cross
>>    echo "c_link_args = [$(meson_quote $CFLAGS $LDFLAGS $EXTRA_CFLAGS $EXTRA_LDFLAGS)]" >> $cross
>>    echo "cpp_link_args = [$(meson_quote $CXXFLAGS $LDFLAGS $EXTRA_CXXFLAGS $EXTRA_LDFLAGS)]" >> $cross
>> +  echo "[cmake]" >> $cross
>> +  echo "CMAKE_C_COMPILER = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
>>    echo "[binaries]" >> $cross
>>    echo "c = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
>>    test -n "$cxx" && echo "cpp = [$(meson_quote $cxx $CPU_CFLAGS)]" >> $cross
> 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 02/18] tests/avocado: Specify target VM argument to helper routines
  2022-01-19 21:41 ` [PATCH v5 02/18] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
@ 2022-01-25  9:40   ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25  9:40 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 627 bytes --]

On Wed, Jan 19, 2022 at 04:41:51PM -0500, Jagannathan Raman wrote:
> Specify target VM for exec_command and
> exec_command_and_wait_for_pattern routines
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
> Reviewed-by: Beraldo Leal <bleal@redhat.com>
> ---
>  tests/avocado/avocado_qemu/__init__.py | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-19 21:41 ` [PATCH v5 03/18] pci: isolated address space for PCI bus Jagannathan Raman
  2022-01-20  0:12   ` Michael S. Tsirkin
@ 2022-01-25  9:56   ` Stefan Hajnoczi
  2022-01-25 13:49     ` Jag Raman
  1 sibling, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25  9:56 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 5752 bytes --]

On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> Allow PCI buses to be part of isolated CPU address spaces. This has a
> niche usage.
> 
> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> the same machine/server. This would cause address space collision as
> well as be a security vulnerability. Having separate address spaces for
> each PCI bus would solve this problem.
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/hw/pci/pci.h     |  2 ++
>  include/hw/pci/pci_bus.h | 17 +++++++++++++++++
>  hw/pci/pci.c             | 17 +++++++++++++++++
>  hw/pci/pci_bridge.c      |  5 +++++
>  4 files changed, 41 insertions(+)
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 023abc0f79..9bb4472abc 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
>  int pci_device_load(PCIDevice *s, QEMUFile *f);
>  MemoryRegion *pci_address_space(PCIDevice *dev);
>  MemoryRegion *pci_address_space_io(PCIDevice *dev);
> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
>  
>  /*
>   * Should not normally be used by devices. For use by sPAPR target
> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> index 347440d42c..d78258e79e 100644
> --- a/include/hw/pci/pci_bus.h
> +++ b/include/hw/pci/pci_bus.h
> @@ -39,9 +39,26 @@ struct PCIBus {
>      void *irq_opaque;
>      PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
>      PCIDevice *parent_dev;
> +
>      MemoryRegion *address_space_mem;
>      MemoryRegion *address_space_io;

This seems like a good point to rename address_space_mem,
address_space_io, as well as PCIIORegion->address_space since they are
all MemoryRegions and not AddressSpaces. Names could be
mem_space_mr/io_space_mr and PCIIORegion->container_mr. This avoids
confusion with the actual AddressSpaces that are introduced in this
patch.

>  
> +    /**
> +     * Isolated address spaces - these allow the PCI bus to be part
> +     * of an isolated address space as opposed to the global
> +     * address_space_memory & address_space_io. This allows the
> +     * bus to be attached to CPUs from different machines. The
> +     * following is not used used commonly.
> +     *
> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
> +     * VM clients, as such it needs the PCI buses in the same machine
> +     * to be part of different CPU address spaces. The following is
> +     * useful in that scenario.
> +     *
> +     */
> +    AddressSpace *isol_as_mem;
> +    AddressSpace *isol_as_io;

Or use the pointers unconditionally and initialize them to the global
address_space_memory/address_space_io? That might simplify the code so
isolated address spaces is no longer a special case.

isol_as_io isn't used by this patch?

> +
>      QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
>      QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
>  
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 5d30f9ca60..d5f1c6c421 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
>      bus->slot_reserved_mask = 0x0;
>      bus->address_space_mem = address_space_mem;
>      bus->address_space_io = address_space_io;
> +    bus->isol_as_mem = NULL;
> +    bus->isol_as_io = NULL;
>      bus->flags |= PCI_BUS_IS_ROOT;
>  
>      /* host bridge */
> @@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
>      return pci_get_bus(dev)->address_space_io;
>  }
>  
> +AddressSpace *pci_isol_as_mem(PCIDevice *dev)
> +{
> +    return pci_get_bus(dev)->isol_as_mem;
> +}
> +
> +AddressSpace *pci_isol_as_io(PCIDevice *dev)
> +{
> +    return pci_get_bus(dev)->isol_as_io;
> +}
> +
>  static void pci_device_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *k = DEVICE_CLASS(klass);
> @@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>  
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>  {
> +    AddressSpace *iommu_as = NULL;
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
>      uint8_t devfn = dev->devfn;
> @@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
>          return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>      }
> +    iommu_as = pci_isol_as_mem(dev);
> +    if (iommu_as) {
> +        return iommu_as;
> +    }
>      return &address_space_memory;
>  }
>  

> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> index da34c8ebcd..98366768d2 100644
> --- a/hw/pci/pci_bridge.c
> +++ b/hw/pci/pci_bridge.c
> @@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
>      sec_bus->address_space_io = &br->address_space_io;
>      memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
>                         4 * GiB);
> +
> +    /* This PCI bridge puts the sec_bus in its parent's address space */
> +    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
> +    sec_bus->isol_as_io = pci_isol_as_io(dev);
> +
>      br->windows = pci_bridge_region_init(br);
>      QLIST_INIT(&sec_bus->child);
>      QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
> -- 
> 2.20.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 04/18] pci: create and free isolated PCI buses
  2022-01-19 21:41 ` [PATCH v5 04/18] pci: create and free isolated PCI buses Jagannathan Raman
@ 2022-01-25 10:25   ` Stefan Hajnoczi
  2022-01-25 14:10     ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 10:25 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 5223 bytes --]

On Wed, Jan 19, 2022 at 04:41:53PM -0500, Jagannathan Raman wrote:
> Adds pci_isol_bus_new() and pci_isol_bus_free() functions to manage
> creation and destruction of isolated PCI buses. Also adds qdev_get_bus
> and qdev_put_bus callbacks to allow the choice of parent bus.
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/hw/pci/pci.h   |   4 +
>  include/hw/qdev-core.h |  16 ++++
>  hw/pci/pci.c           | 169 +++++++++++++++++++++++++++++++++++++++++
>  softmmu/qdev-monitor.c |  39 +++++++++-
>  4 files changed, 225 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 9bb4472abc..8c18f10d9d 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -452,6 +452,10 @@ PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus *rootbus,
>  
>  PCIDevice *pci_vga_init(PCIBus *bus);
>  
> +PCIBus *pci_isol_bus_new(BusState *parent_bus, const char *new_bus_type,
> +                         Error **errp);
> +bool pci_isol_bus_free(PCIBus *pci_bus, Error **errp);
> +
>  static inline PCIBus *pci_get_bus(const PCIDevice *dev)
>  {
>      return PCI_BUS(qdev_get_parent_bus(DEVICE(dev)));
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 92c3d65208..eed2983072 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -419,6 +419,20 @@ void qdev_simple_device_unplug_cb(HotplugHandler *hotplug_dev,
>  void qdev_machine_creation_done(void);
>  bool qdev_machine_modified(void);
>  
> +/**
> + * Find parent bus - these callbacks are used during device addition
> + * and deletion.
> + *
> + * During addition, if no parent bus is specified in the options,
> + * these callbacks provide a way to figure it out based on the
> + * bus type. If these callbacks are not defined, defaults to
> + * finding the parent bus starting from default system bus
> + */
> +typedef bool (QDevGetBusFunc)(const char *type, BusState **bus, Error **errp);
> +typedef bool (QDevPutBusFunc)(BusState *bus, Error **errp);
> +bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
> +                      Error **errp);

Where is this used, it doesn't seem related to pci_isol_bus_new()?

> +
>  /**
>   * GpioPolarity: Polarity of a GPIO line
>   *
> @@ -691,6 +705,8 @@ BusState *qdev_get_parent_bus(DeviceState *dev);
>  /*** BUS API. ***/
>  
>  DeviceState *qdev_find_recursive(BusState *bus, const char *id);
> +BusState *qbus_find_recursive(BusState *bus, const char *name,
> +                              const char *bus_typename);
>  
>  /* Returns 0 to walk children, > 0 to skip walk, < 0 to terminate walk. */
>  typedef int (qbus_walkerfn)(BusState *bus, void *opaque);
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index d5f1c6c421..63ec1e47b5 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -493,6 +493,175 @@ void pci_root_bus_cleanup(PCIBus *bus)
>      qbus_unrealize(BUS(bus));
>  }
>  
> +static void pci_bus_free_isol_mem(PCIBus *pci_bus)
> +{
> +    if (pci_bus->address_space_mem) {
> +        memory_region_unref(pci_bus->address_space_mem);

memory_region_unref() already does a NULL pointer check so the if
statements in this function aren't needed.

> +        pci_bus->address_space_mem = NULL;
> +    }
> +
> +    if (pci_bus->isol_as_mem) {
> +        address_space_destroy(pci_bus->isol_as_mem);
> +        pci_bus->isol_as_mem = NULL;
> +    }
> +
> +    if (pci_bus->address_space_io) {
> +        memory_region_unref(pci_bus->address_space_io);
> +        pci_bus->address_space_io = NULL;
> +    }
> +
> +    if (pci_bus->isol_as_io) {
> +        address_space_destroy(pci_bus->isol_as_io);
> +        pci_bus->isol_as_io = NULL;
> +    }
> +}
> +
> +static void pci_bus_init_isol_mem(PCIBus *pci_bus, uint32_t unique_id)
> +{
> +    g_autofree char *mem_mr_name = NULL;
> +    g_autofree char *mem_as_name = NULL;
> +    g_autofree char *io_mr_name = NULL;
> +    g_autofree char *io_as_name = NULL;
> +
> +    if (!pci_bus) {
> +        return;
> +    }
> +
> +    mem_mr_name = g_strdup_printf("mem-mr-%u", unique_id);
> +    mem_as_name = g_strdup_printf("mem-as-%u", unique_id);
> +    io_mr_name = g_strdup_printf("io-mr-%u", unique_id);
> +    io_as_name = g_strdup_printf("io-as-%u", unique_id);
> +
> +    pci_bus->address_space_mem = g_malloc0(sizeof(MemoryRegion));
> +    pci_bus->isol_as_mem = g_malloc0(sizeof(AddressSpace));
> +    memory_region_init(pci_bus->address_space_mem, NULL,
> +                       mem_mr_name, UINT64_MAX);
> +    address_space_init(pci_bus->isol_as_mem,
> +                       pci_bus->address_space_mem, mem_as_name);
> +
> +    pci_bus->address_space_io = g_malloc0(sizeof(MemoryRegion));
> +    pci_bus->isol_as_io = g_malloc0(sizeof(AddressSpace));

Where are address_space_mem, isol_as_mem, address_space_io, and
isol_as_io freed? I think the unref calls won't free them because the
objects were created with object_initialize() instead of object_new().

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 05/18] qdev: unplug blocker for devices
  2022-01-19 21:41 ` [PATCH v5 05/18] qdev: unplug blocker for devices Jagannathan Raman
@ 2022-01-25 10:27   ` Stefan Hajnoczi
  2022-01-25 14:43     ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 10:27 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 2676 bytes --]

On Wed, Jan 19, 2022 at 04:41:54PM -0500, Jagannathan Raman wrote:
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/hw/qdev-core.h |  5 +++++
>  softmmu/qdev-monitor.c | 35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index eed2983072..67df5e0081 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -193,6 +193,7 @@ struct DeviceState {
>      int instance_id_alias;
>      int alias_required_for_version;
>      ResettableState reset;
> +    GSList *unplug_blockers;
>  };
>  
>  struct DeviceListener {
> @@ -433,6 +434,10 @@ typedef bool (QDevPutBusFunc)(BusState *bus, Error **errp);
>  bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
>                        Error **errp);
>  
> +int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp);
> +void qdev_del_unplug_blocker(DeviceState *dev, Error *reason);
> +bool qdev_unplug_blocked(DeviceState *dev, Error **errp);
> +
>  /**
>   * GpioPolarity: Polarity of a GPIO line
>   *
> diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
> index 7306074019..1a169f89a2 100644
> --- a/softmmu/qdev-monitor.c
> +++ b/softmmu/qdev-monitor.c
> @@ -978,10 +978,45 @@ void qmp_device_del(const char *id, Error **errp)
>              return;
>          }
>  
> +        if (qdev_unplug_blocked(dev, errp)) {
> +            return;
> +        }
> +
>          qdev_unplug(dev, errp);
>      }
>  }
>  
> +int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp)
> +{
> +    ERRP_GUARD();
> +
> +    if (!migration_is_idle()) {
> +        error_setg(errp, "migration is in progress");
> +        return -EBUSY;
> +    }

Why can this function not be called during migration?

> +
> +    dev->unplug_blockers = g_slist_prepend(dev->unplug_blockers, reason);
> +
> +    return 0;
> +}
> +
> +void qdev_del_unplug_blocker(DeviceState *dev, Error *reason)
> +{
> +    dev->unplug_blockers = g_slist_remove(dev->unplug_blockers, reason);
> +}
> +
> +bool qdev_unplug_blocked(DeviceState *dev, Error **errp)
> +{
> +    ERRP_GUARD();
> +
> +    if (dev->unplug_blockers) {
> +        error_propagate(errp, error_copy(dev->unplug_blockers->data));
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
>  void hmp_device_add(Monitor *mon, const QDict *qdict)
>  {
>      Error *err = NULL;
> -- 
> 2.20.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine
  2022-01-19 21:41 ` [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine Jagannathan Raman
@ 2022-01-25 10:32   ` Stefan Hajnoczi
  2022-01-25 18:12     ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 10:32 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 2584 bytes --]

On Wed, Jan 19, 2022 at 04:41:55PM -0500, Jagannathan Raman wrote:
> Allow hotplugging of PCI(e) devices to remote machine
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/machine.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)

Why is this code necessary? I expected the default hotplug behavior to
pretty much handle this case - hotplugging device types that the bus
doesn't support should fail and unplug should already unparent/unrealize
the device.

> 
> diff --git a/hw/remote/machine.c b/hw/remote/machine.c
> index 952105eab5..220ff01aa9 100644
> --- a/hw/remote/machine.c
> +++ b/hw/remote/machine.c
> @@ -54,14 +54,39 @@ static void remote_machine_init(MachineState *machine)
>  
>      pci_bus_irqs(pci_host->bus, remote_iohub_set_irq, remote_iohub_map_irq,
>                   &s->iohub, REMOTE_IOHUB_NB_PIRQS);
> +
> +    qbus_set_hotplug_handler(BUS(pci_host->bus), OBJECT(s));
> +}
> +
> +static void remote_machine_pre_plug_cb(HotplugHandler *hotplug_dev,
> +                                       DeviceState *dev, Error **errp)
> +{
> +    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
> +        error_setg(errp, "Only allowing PCI hotplug");
> +    }
> +}
> +
> +static void remote_machine_unplug_cb(HotplugHandler *hotplug_dev,
> +                                     DeviceState *dev, Error **errp)
> +{
> +    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
> +        error_setg(errp, "Only allowing PCI hot-unplug");
> +        return;
> +    }
> +
> +    qdev_unrealize(dev);
>  }
>  
>  static void remote_machine_class_init(ObjectClass *oc, void *data)
>  {
>      MachineClass *mc = MACHINE_CLASS(oc);
> +    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);
>  
>      mc->init = remote_machine_init;
>      mc->desc = "Experimental remote machine";
> +
> +    hc->pre_plug = remote_machine_pre_plug_cb;
> +    hc->unplug = remote_machine_unplug_cb;
>  }
>  
>  static const TypeInfo remote_machine = {
> @@ -69,6 +94,10 @@ static const TypeInfo remote_machine = {
>      .parent = TYPE_MACHINE,
>      .instance_size = sizeof(RemoteMachineState),
>      .class_init = remote_machine_class_init,
> +    .interfaces = (InterfaceInfo[]) {
> +        { TYPE_HOTPLUG_HANDLER },
> +        { }
> +    }
>  };
>  
>  static void remote_machine_register_types(void)
> -- 
> 2.20.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 07/18] vfio-user: set qdev bus callbacks for remote machine
  2022-01-19 21:41 ` [PATCH v5 07/18] vfio-user: set qdev bus callbacks " Jagannathan Raman
@ 2022-01-25 10:44   ` Stefan Hajnoczi
  2022-01-25 21:12     ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 10:44 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 3684 bytes --]

On Wed, Jan 19, 2022 at 04:41:56PM -0500, Jagannathan Raman wrote:
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/machine.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> diff --git a/hw/remote/machine.c b/hw/remote/machine.c
> index 220ff01aa9..221a8430c1 100644
> --- a/hw/remote/machine.c
> +++ b/hw/remote/machine.c
> @@ -22,6 +22,60 @@
>  #include "hw/pci/pci_host.h"
>  #include "hw/remote/iohub.h"
>  
> +static bool remote_machine_get_bus(const char *type, BusState **bus,
> +                                   Error **errp)
> +{
> +    ERRP_GUARD();
> +    RemoteMachineState *s = REMOTE_MACHINE(current_machine);
> +    BusState *root_bus = NULL;
> +    PCIBus *new_pci_bus = NULL;
> +
> +    if (!bus) {
> +        error_setg(errp, "Invalid argument");
> +        return false;
> +    }
> +
> +    if (strcmp(type, TYPE_PCI_BUS) && strcmp(type, TYPE_PCI_BUS)) {
> +        return true;
> +    }
> +
> +    root_bus = qbus_find_recursive(sysbus_get_default(), NULL, TYPE_PCIE_BUS);
> +    if (!root_bus) {
> +        error_setg(errp, "Unable to find root PCI device");
> +        return false;
> +    }
> +
> +    new_pci_bus = pci_isol_bus_new(root_bus, type, errp);
> +    if (!new_pci_bus) {
> +        return false;
> +    }
> +
> +    *bus = BUS(new_pci_bus);
> +
> +    pci_bus_irqs(new_pci_bus, remote_iohub_set_irq, remote_iohub_map_irq,
> +                 &s->iohub, REMOTE_IOHUB_NB_PIRQS);
> +
> +    return true;
> +}

Can the user create the same PCI bus via QMP commands? If so, then this
is just a convenience that saves the extra step. Or is there some magic
that cannot be done via QMP device_add?

I'm asking because there are 3 objects involved and I'd like to
understand the lifecycle/dependencies:
1. The PCIDevice we wish to export.
2. The PCIBus with isolated address spaces that contains the PCIDevice.
3. The vfio-user server that exports a given PCIDevice.

Users can already create the PCIDevice via hotplug and the vfio-user
server via object-add. So if there's no magic they could also create the
PCI bus:
1. device_add ...some PCI bus stuff here...,id=isol-pci-bus0
2. device_add ...the PCIDevice...,bus=isol-pci-bus0,id=mydev
3. object-add x-vfio-user-server,device=mydev

Unplug would work in the reverse order.

It may be more convenient to automatically create a PCIBus when the
PCIDevice is hotplugged, but this kind of magic also has drawbacks
(hidden devices, naming collisions, etc).

> +
> +static bool remote_machine_put_bus(BusState *bus, Error **errp)
> +{
> +    PCIBus *pci_bus = NULL;
> +
> +    if (!bus) {
> +        error_setg(errp, "Invalid argument");
> +        return false;
> +    }
> +
> +    if (!object_dynamic_cast(OBJECT(bus), TYPE_PCI_BUS)) {
> +        return true;
> +    }
> +
> +    pci_bus = PCI_BUS(bus);
> +
> +    return pci_isol_bus_free(pci_bus, errp);
> +}
> +
>  static void remote_machine_init(MachineState *machine)
>  {
>      MemoryRegion *system_memory, *system_io, *pci_memory;
> @@ -56,6 +110,9 @@ static void remote_machine_init(MachineState *machine)
>                   &s->iohub, REMOTE_IOHUB_NB_PIRQS);
>  
>      qbus_set_hotplug_handler(BUS(pci_host->bus), OBJECT(s));
> +
> +    qdev_set_bus_cbs(remote_machine_get_bus, remote_machine_put_bus,
> +                     &error_fatal);
>  }
>  
>  static void remote_machine_pre_plug_cb(HotplugHandler *hotplug_dev,
> -- 
> 2.20.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-25  9:56   ` Stefan Hajnoczi
@ 2022-01-25 13:49     ` Jag Raman
  2022-01-25 14:19       ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-01-25 13:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 25, 2022, at 4:56 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
>> Allow PCI buses to be part of isolated CPU address spaces. This has a
>> niche usage.
>> 
>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
>> the same machine/server. This would cause address space collision as
>> well as be a security vulnerability. Having separate address spaces for
>> each PCI bus would solve this problem.
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> include/hw/pci/pci.h     |  2 ++
>> include/hw/pci/pci_bus.h | 17 +++++++++++++++++
>> hw/pci/pci.c             | 17 +++++++++++++++++
>> hw/pci/pci_bridge.c      |  5 +++++
>> 4 files changed, 41 insertions(+)
>> 
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index 023abc0f79..9bb4472abc 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
>> int pci_device_load(PCIDevice *s, QEMUFile *f);
>> MemoryRegion *pci_address_space(PCIDevice *dev);
>> MemoryRegion *pci_address_space_io(PCIDevice *dev);
>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
>> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
>> 
>> /*
>>  * Should not normally be used by devices. For use by sPAPR target
>> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
>> index 347440d42c..d78258e79e 100644
>> --- a/include/hw/pci/pci_bus.h
>> +++ b/include/hw/pci/pci_bus.h
>> @@ -39,9 +39,26 @@ struct PCIBus {
>>     void *irq_opaque;
>>     PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
>>     PCIDevice *parent_dev;
>> +
>>     MemoryRegion *address_space_mem;
>>     MemoryRegion *address_space_io;
> 
> This seems like a good point to rename address_space_mem,
> address_space_io, as well as PCIIORegion->address_space since they are
> all MemoryRegions and not AddressSpaces. Names could be
> mem_space_mr/io_space_mr and PCIIORegion->container_mr. This avoids
> confusion with the actual AddressSpaces that are introduced in this
> patch.

Are you referring to renaming address_space_mem, address_space_io and
PCIIORegion->address_space alone? I’m asking because there are many
other symbols in the code which are named similarly.

> 
>> 
>> +    /**
>> +     * Isolated address spaces - these allow the PCI bus to be part
>> +     * of an isolated address space as opposed to the global
>> +     * address_space_memory & address_space_io. This allows the
>> +     * bus to be attached to CPUs from different machines. The
>> +     * following is not used used commonly.
>> +     *
>> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
>> +     * VM clients, as such it needs the PCI buses in the same machine
>> +     * to be part of different CPU address spaces. The following is
>> +     * useful in that scenario.
>> +     *
>> +     */
>> +    AddressSpace *isol_as_mem;
>> +    AddressSpace *isol_as_io;
> 
> Or use the pointers unconditionally and initialize them to the global
> address_space_memory/address_space_io? That might simplify the code so
> isolated address spaces is no longer a special case.

I did start off with using these pointers unconditionally - but adopted an optional
isolated address space for the following reasons:
  - There is a potential for regression
  - CPU address space per bus is not a common scenario. In most case, all PCI
    buses are attached to CPU sharing the same address space. Therefore, an
    optional address space made sense for this special scenario

We can also set it unconditionally if you prefer, kindly confirm.

> 
> isol_as_io isn't used by this patch?

This patch introduces these variables, defines its getters and sets them to NULL in
places where new PCI buses are presently created. The following patch creates a
separate isolated address space:
[PATCH v5 04/18] pci: create and free isolated PCI buses

I could merge these patches if you prefer.

--
Jag

> 
>> +
>>     QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
>>     QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
>> 
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 5d30f9ca60..d5f1c6c421 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
>>     bus->slot_reserved_mask = 0x0;
>>     bus->address_space_mem = address_space_mem;
>>     bus->address_space_io = address_space_io;
>> +    bus->isol_as_mem = NULL;
>> +    bus->isol_as_io = NULL;
>>     bus->flags |= PCI_BUS_IS_ROOT;
>> 
>>     /* host bridge */
>> @@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
>>     return pci_get_bus(dev)->address_space_io;
>> }
>> 
>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev)
>> +{
>> +    return pci_get_bus(dev)->isol_as_mem;
>> +}
>> +
>> +AddressSpace *pci_isol_as_io(PCIDevice *dev)
>> +{
>> +    return pci_get_bus(dev)->isol_as_io;
>> +}
>> +
>> static void pci_device_class_init(ObjectClass *klass, void *data)
>> {
>>     DeviceClass *k = DEVICE_CLASS(klass);
>> @@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>> 
>> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> {
>> +    AddressSpace *iommu_as = NULL;
>>     PCIBus *bus = pci_get_bus(dev);
>>     PCIBus *iommu_bus = bus;
>>     uint8_t devfn = dev->devfn;
>> @@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>     if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
>>         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>>     }
>> +    iommu_as = pci_isol_as_mem(dev);
>> +    if (iommu_as) {
>> +        return iommu_as;
>> +    }
>>     return &address_space_memory;
>> }
>> 
> 
>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
>> index da34c8ebcd..98366768d2 100644
>> --- a/hw/pci/pci_bridge.c
>> +++ b/hw/pci/pci_bridge.c
>> @@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
>>     sec_bus->address_space_io = &br->address_space_io;
>>     memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
>>                        4 * GiB);
>> +
>> +    /* This PCI bridge puts the sec_bus in its parent's address space */
>> +    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
>> +    sec_bus->isol_as_io = pci_isol_as_io(dev);
>> +
>>     br->windows = pci_bridge_region_init(br);
>>     QLIST_INIT(&sec_bus->child);
>>     QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
>> -- 
>> 2.20.1
>> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 04/18] pci: create and free isolated PCI buses
  2022-01-25 10:25   ` Stefan Hajnoczi
@ 2022-01-25 14:10     ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-25 14:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 25, 2022, at 5:25 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:41:53PM -0500, Jagannathan Raman wrote:
>> Adds pci_isol_bus_new() and pci_isol_bus_free() functions to manage
>> creation and destruction of isolated PCI buses. Also adds qdev_get_bus
>> and qdev_put_bus callbacks to allow the choice of parent bus.
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> include/hw/pci/pci.h   |   4 +
>> include/hw/qdev-core.h |  16 ++++
>> hw/pci/pci.c           | 169 +++++++++++++++++++++++++++++++++++++++++
>> softmmu/qdev-monitor.c |  39 +++++++++-
>> 4 files changed, 225 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index 9bb4472abc..8c18f10d9d 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -452,6 +452,10 @@ PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus *rootbus,
>> 
>> PCIDevice *pci_vga_init(PCIBus *bus);
>> 
>> +PCIBus *pci_isol_bus_new(BusState *parent_bus, const char *new_bus_type,
>> +                         Error **errp);
>> +bool pci_isol_bus_free(PCIBus *pci_bus, Error **errp);
>> +
>> static inline PCIBus *pci_get_bus(const PCIDevice *dev)
>> {
>>     return PCI_BUS(qdev_get_parent_bus(DEVICE(dev)));
>> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
>> index 92c3d65208..eed2983072 100644
>> --- a/include/hw/qdev-core.h
>> +++ b/include/hw/qdev-core.h
>> @@ -419,6 +419,20 @@ void qdev_simple_device_unplug_cb(HotplugHandler *hotplug_dev,
>> void qdev_machine_creation_done(void);
>> bool qdev_machine_modified(void);
>> 
>> +/**
>> + * Find parent bus - these callbacks are used during device addition
>> + * and deletion.
>> + *
>> + * During addition, if no parent bus is specified in the options,
>> + * these callbacks provide a way to figure it out based on the
>> + * bus type. If these callbacks are not defined, defaults to
>> + * finding the parent bus starting from default system bus
>> + */
>> +typedef bool (QDevGetBusFunc)(const char *type, BusState **bus, Error **errp);
>> +typedef bool (QDevPutBusFunc)(BusState *bus, Error **errp);
>> +bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
>> +                      Error **errp);
> 
> Where is this used, it doesn't seem related to pci_isol_bus_new()?

Yes, this is no directly related to pci_isol_bus_new() - will move it to a separate patch.

> 
>> +
>> /**
>>  * GpioPolarity: Polarity of a GPIO line
>>  *
>> @@ -691,6 +705,8 @@ BusState *qdev_get_parent_bus(DeviceState *dev);
>> /*** BUS API. ***/
>> 
>> DeviceState *qdev_find_recursive(BusState *bus, const char *id);
>> +BusState *qbus_find_recursive(BusState *bus, const char *name,
>> +                              const char *bus_typename);
>> 
>> /* Returns 0 to walk children, > 0 to skip walk, < 0 to terminate walk. */
>> typedef int (qbus_walkerfn)(BusState *bus, void *opaque);
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index d5f1c6c421..63ec1e47b5 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -493,6 +493,175 @@ void pci_root_bus_cleanup(PCIBus *bus)
>>     qbus_unrealize(BUS(bus));
>> }
>> 
>> +static void pci_bus_free_isol_mem(PCIBus *pci_bus)
>> +{
>> +    if (pci_bus->address_space_mem) {
>> +        memory_region_unref(pci_bus->address_space_mem);
> 
> memory_region_unref() already does a NULL pointer check so the if
> statements in this function aren't needed.

Got it, thank you!

> 
>> +        pci_bus->address_space_mem = NULL;
>> +    }
>> +
>> +    if (pci_bus->isol_as_mem) {
>> +        address_space_destroy(pci_bus->isol_as_mem);
>> +        pci_bus->isol_as_mem = NULL;
>> +    }
>> +
>> +    if (pci_bus->address_space_io) {
>> +        memory_region_unref(pci_bus->address_space_io);
>> +        pci_bus->address_space_io = NULL;
>> +    }
>> +
>> +    if (pci_bus->isol_as_io) {
>> +        address_space_destroy(pci_bus->isol_as_io);
>> +        pci_bus->isol_as_io = NULL;
>> +    }
>> +}
>> +
>> +static void pci_bus_init_isol_mem(PCIBus *pci_bus, uint32_t unique_id)
>> +{
>> +    g_autofree char *mem_mr_name = NULL;
>> +    g_autofree char *mem_as_name = NULL;
>> +    g_autofree char *io_mr_name = NULL;
>> +    g_autofree char *io_as_name = NULL;
>> +
>> +    if (!pci_bus) {
>> +        return;
>> +    }
>> +
>> +    mem_mr_name = g_strdup_printf("mem-mr-%u", unique_id);
>> +    mem_as_name = g_strdup_printf("mem-as-%u", unique_id);
>> +    io_mr_name = g_strdup_printf("io-mr-%u", unique_id);
>> +    io_as_name = g_strdup_printf("io-as-%u", unique_id);
>> +
>> +    pci_bus->address_space_mem = g_malloc0(sizeof(MemoryRegion));
>> +    pci_bus->isol_as_mem = g_malloc0(sizeof(AddressSpace));
>> +    memory_region_init(pci_bus->address_space_mem, NULL,
>> +                       mem_mr_name, UINT64_MAX);
>> +    address_space_init(pci_bus->isol_as_mem,
>> +                       pci_bus->address_space_mem, mem_as_name);
>> +
>> +    pci_bus->address_space_io = g_malloc0(sizeof(MemoryRegion));
>> +    pci_bus->isol_as_io = g_malloc0(sizeof(AddressSpace));
> 
> Where are address_space_mem, isol_as_mem, address_space_io, and
> isol_as_io freed? I think the unref calls won't free them because the
> objects were created with object_initialize() instead of object_new().

Ah OK got it, thank you! Will fix it.

I think we could set the owner of the the memory regions to the PCI bus,
as opposed to NULL. We could also add an ‘instance_finalize’ function to
PCI bus which would free them.

--
Jag

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-25 13:49     ` Jag Raman
@ 2022-01-25 14:19       ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 14:19 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert

[-- Attachment #1: Type: text/plain, Size: 5058 bytes --]

On Tue, Jan 25, 2022 at 01:49:23PM +0000, Jag Raman wrote:
> 
> 
> > On Jan 25, 2022, at 4:56 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> >> Allow PCI buses to be part of isolated CPU address spaces. This has a
> >> niche usage.
> >> 
> >> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> >> the same machine/server. This would cause address space collision as
> >> well as be a security vulnerability. Having separate address spaces for
> >> each PCI bus would solve this problem.
> >> 
> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >> ---
> >> include/hw/pci/pci.h     |  2 ++
> >> include/hw/pci/pci_bus.h | 17 +++++++++++++++++
> >> hw/pci/pci.c             | 17 +++++++++++++++++
> >> hw/pci/pci_bridge.c      |  5 +++++
> >> 4 files changed, 41 insertions(+)
> >> 
> >> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >> index 023abc0f79..9bb4472abc 100644
> >> --- a/include/hw/pci/pci.h
> >> +++ b/include/hw/pci/pci.h
> >> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
> >> int pci_device_load(PCIDevice *s, QEMUFile *f);
> >> MemoryRegion *pci_address_space(PCIDevice *dev);
> >> MemoryRegion *pci_address_space_io(PCIDevice *dev);
> >> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
> >> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
> >> 
> >> /*
> >>  * Should not normally be used by devices. For use by sPAPR target
> >> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> >> index 347440d42c..d78258e79e 100644
> >> --- a/include/hw/pci/pci_bus.h
> >> +++ b/include/hw/pci/pci_bus.h
> >> @@ -39,9 +39,26 @@ struct PCIBus {
> >>     void *irq_opaque;
> >>     PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
> >>     PCIDevice *parent_dev;
> >> +
> >>     MemoryRegion *address_space_mem;
> >>     MemoryRegion *address_space_io;
> > 
> > This seems like a good point to rename address_space_mem,
> > address_space_io, as well as PCIIORegion->address_space since they are
> > all MemoryRegions and not AddressSpaces. Names could be
> > mem_space_mr/io_space_mr and PCIIORegion->container_mr. This avoids
> > confusion with the actual AddressSpaces that are introduced in this
> > patch.
> 
> Are you referring to renaming address_space_mem, address_space_io and
> PCIIORegion->address_space alone? I’m asking because there are many
> other symbols in the code which are named similarly.

I only see those symbols in hw/pci/pci.c. Which ones were you thinking
about?

> > 
> >> 
> >> +    /**
> >> +     * Isolated address spaces - these allow the PCI bus to be part
> >> +     * of an isolated address space as opposed to the global
> >> +     * address_space_memory & address_space_io. This allows the
> >> +     * bus to be attached to CPUs from different machines. The
> >> +     * following is not used used commonly.
> >> +     *
> >> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
> >> +     * VM clients, as such it needs the PCI buses in the same machine
> >> +     * to be part of different CPU address spaces. The following is
> >> +     * useful in that scenario.
> >> +     *
> >> +     */
> >> +    AddressSpace *isol_as_mem;
> >> +    AddressSpace *isol_as_io;
> > 
> > Or use the pointers unconditionally and initialize them to the global
> > address_space_memory/address_space_io? That might simplify the code so
> > isolated address spaces is no longer a special case.
> 
> I did start off with using these pointers unconditionally - but adopted an optional
> isolated address space for the following reasons:
>   - There is a potential for regression
>   - CPU address space per bus is not a common scenario. In most case, all PCI
>     buses are attached to CPU sharing the same address space. Therefore, an
>     optional address space made sense for this special scenario
> 
> We can also set it unconditionally if you prefer, kindly confirm.

It's a matter of taste. I don't have a strong opinion on it but
personally I would try to make it unconditional. I think the risk of
regressions is low and the code complexity will be lower than making it
a special case. If you wanted to keep it as is, that's fine.

> 
> > 
> > isol_as_io isn't used by this patch?
> 
> This patch introduces these variables, defines its getters and sets them to NULL in
> places where new PCI buses are presently created. The following patch creates a
> separate isolated address space:
> [PATCH v5 04/18] pci: create and free isolated PCI buses
> 
> I could merge these patches if you prefer.

The only place I saw that reads isol_as_io is "[PATCH v5 15/18]
vfio-user: handle PCI BAR accesses", but that's for PCI I/O Space
accesses. Did I miss where I/O Space BARs are mapped into isol_as_io?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 09/18] vfio-user: define vfio-user-server object
  2022-01-19 21:41 ` [PATCH v5 09/18] vfio-user: define vfio-user-server object Jagannathan Raman
@ 2022-01-25 14:40   ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 14:40 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 278 bytes --]

On Wed, Jan 19, 2022 at 04:41:58PM -0500, Jagannathan Raman wrote:
> +/**
> + * VFU_OBJECT_ERROR - reports an error message. If auto_shutdown
> + * is set, it abort the machine on error. Otherwise, it logs an

s/abort/aborts/

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 05/18] qdev: unplug blocker for devices
  2022-01-25 10:27   ` Stefan Hajnoczi
@ 2022-01-25 14:43     ` Jag Raman
  2022-01-26  9:32       ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-01-25 14:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 25, 2022, at 5:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:41:54PM -0500, Jagannathan Raman wrote:
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> include/hw/qdev-core.h |  5 +++++
>> softmmu/qdev-monitor.c | 35 +++++++++++++++++++++++++++++++++++
>> 2 files changed, 40 insertions(+)
>> 
>> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
>> index eed2983072..67df5e0081 100644
>> --- a/include/hw/qdev-core.h
>> +++ b/include/hw/qdev-core.h
>> @@ -193,6 +193,7 @@ struct DeviceState {
>>     int instance_id_alias;
>>     int alias_required_for_version;
>>     ResettableState reset;
>> +    GSList *unplug_blockers;
>> };
>> 
>> struct DeviceListener {
>> @@ -433,6 +434,10 @@ typedef bool (QDevPutBusFunc)(BusState *bus, Error **errp);
>> bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
>>                       Error **errp);
>> 
>> +int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp);
>> +void qdev_del_unplug_blocker(DeviceState *dev, Error *reason);
>> +bool qdev_unplug_blocked(DeviceState *dev, Error **errp);
>> +
>> /**
>>  * GpioPolarity: Polarity of a GPIO line
>>  *
>> diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
>> index 7306074019..1a169f89a2 100644
>> --- a/softmmu/qdev-monitor.c
>> +++ b/softmmu/qdev-monitor.c
>> @@ -978,10 +978,45 @@ void qmp_device_del(const char *id, Error **errp)
>>             return;
>>         }
>> 
>> +        if (qdev_unplug_blocked(dev, errp)) {
>> +            return;
>> +        }
>> +
>>         qdev_unplug(dev, errp);
>>     }
>> }
>> 
>> +int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +
>> +    if (!migration_is_idle()) {
>> +        error_setg(errp, "migration is in progress");
>> +        return -EBUSY;
>> +    }
> 
> Why can this function not be called during migration?

Since ‘unplug_blockers' is a member of the device, I thought it wouldn’t be correct to
allow changes to the device's state during migration.

I did weigh the following reasons against adding this check:
  - unplug_blockers is not migrated to the destination anyway, so it doesn’t matter if
    it changes after migration starts
  - whichever code/object that needs to add the blocker could add it at the destination
    if needed

However, unlike qmp_device_add(), qmp_object_add() doesn’t reject during
migration. As such, an object could add a blocker for the device when migration is
in progress.

Would you prefer to throw a warning, or fully remove this test?

--
Jag 

> 
>> +
>> +    dev->unplug_blockers = g_slist_prepend(dev->unplug_blockers, reason);
>> +
>> +    return 0;
>> +}
>> +
>> +void qdev_del_unplug_blocker(DeviceState *dev, Error *reason)
>> +{
>> +    dev->unplug_blockers = g_slist_remove(dev->unplug_blockers, reason);
>> +}
>> +
>> +bool qdev_unplug_blocked(DeviceState *dev, Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +
>> +    if (dev->unplug_blockers) {
>> +        error_propagate(errp, error_copy(dev->unplug_blockers->data));
>> +        return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> void hmp_device_add(Monitor *mon, const QDict *qdict)
>> {
>>     Error *err = NULL;
>> -- 
>> 2.20.1
>> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 10/18] vfio-user: instantiate vfio-user context
  2022-01-19 21:41 ` [PATCH v5 10/18] vfio-user: instantiate vfio-user context Jagannathan Raman
@ 2022-01-25 14:44   ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 14:44 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 4860 bytes --]

On Wed, Jan 19, 2022 at 04:41:59PM -0500, Jagannathan Raman wrote:
> create a context with the vfio-user library to run a PCI device
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 78 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 78 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 80757b0029..810a7c3943 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -41,6 +41,9 @@
>  #include "hw/remote/machine.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-visit-sockets.h"
> +#include "qemu/notify.h"
> +#include "sysemu/sysemu.h"
> +#include "libvfio-user.h"
>  
>  #define TYPE_VFU_OBJECT "x-vfio-user-server"
>  OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
> @@ -82,13 +85,23 @@ struct VfuObject {
>      char *device;
>  
>      Error *err;
> +
> +    Notifier machine_done;
> +
> +    vfu_ctx_t *vfu_ctx;
>  };
>  
> +static void vfu_object_init_ctx(VfuObject *o, Error **errp);
> +
>  static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
>                                    void *opaque, Error **errp)
>  {
>      VfuObject *o = VFU_OBJECT(obj);
>  
> +    if (o->vfu_ctx) {
> +        return;
> +    }

No error?

> +
>      qapi_free_SocketAddress(o->socket);
>  
>      o->socket = NULL;
> @@ -104,17 +117,68 @@ static void vfu_object_set_socket(Object *obj, Visitor *v, const char *name,
>      }
>  
>      trace_vfu_prop("socket", o->socket->u.q_unix.path);
> +
> +    vfu_object_init_ctx(o, errp);
>  }
>  
>  static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
>  {
>      VfuObject *o = VFU_OBJECT(obj);
>  
> +    if (o->vfu_ctx) {
> +        return;
> +    }

No error?

> +
>      g_free(o->device);
>  
>      o->device = g_strdup(str);
>  
>      trace_vfu_prop("device", str);
> +
> +    vfu_object_init_ctx(o, errp);
> +}
> +
> +/*
> + * TYPE_VFU_OBJECT depends on the availability of the 'socket' and 'device'
> + * properties. It also depends on devices instantiated in QEMU. These
> + * dependencies are not available during the instance_init phase of this
> + * object's life-cycle. As such, the server is initialized after the
> + * machine is setup. machine_init_done_notifier notifies TYPE_VFU_OBJECT
> + * when the machine is setup, and the dependencies are available.
> + */
> +static void vfu_object_machine_done(Notifier *notifier, void *data)
> +{
> +    VfuObject *o = container_of(notifier, VfuObject, machine_done);
> +    Error *err = NULL;
> +
> +    vfu_object_init_ctx(o, &err);
> +
> +    if (err) {
> +        error_propagate(&error_abort, err);
> +    }
> +}
> +
> +static void vfu_object_init_ctx(VfuObject *o, Error **errp)
> +{
> +    ERRP_GUARD();
> +
> +    if (o->vfu_ctx || !o->socket || !o->device ||
> +            !phase_check(PHASE_MACHINE_READY)) {
> +        return;
> +    }
> +
> +    if (o->err) {
> +        error_propagate(errp, o->err);
> +        o->err = NULL;
> +        return;
> +    }
> +
> +    o->vfu_ctx = vfu_create_ctx(VFU_TRANS_SOCK, o->socket->u.q_unix.path, 0,
> +                                o, VFU_DEV_TYPE_PCI);
> +    if (o->vfu_ctx == NULL) {
> +        error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
> +        return;
> +    }
>  }
>  
>  static void vfu_object_init(Object *obj)
> @@ -124,6 +188,11 @@ static void vfu_object_init(Object *obj)
>  
>      k->nr_devs++;
>  
> +    if (!phase_check(PHASE_MACHINE_READY)) {
> +        o->machine_done.notify = vfu_object_machine_done;
> +        qemu_add_machine_init_done_notifier(&o->machine_done);
> +    }
> +
>      if (!object_dynamic_cast(OBJECT(current_machine), TYPE_REMOTE_MACHINE)) {
>          error_setg(&o->err, "vfu: %s only compatible with %s machine",
>                     TYPE_VFU_OBJECT, TYPE_REMOTE_MACHINE);
> @@ -142,6 +211,10 @@ static void vfu_object_finalize(Object *obj)
>  
>      o->socket = NULL;
>  
> +    if (o->vfu_ctx) {
> +        vfu_destroy_ctx(o->vfu_ctx);
> +    }
> +
>      g_free(o->device);
>  
>      o->device = NULL;
> @@ -149,6 +222,11 @@ static void vfu_object_finalize(Object *obj)
>      if (!k->nr_devs && k->auto_shutdown) {
>          qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
>      }
> +
> +    if (o->machine_done.notify) {
> +        qemu_remove_machine_init_done_notifier(&o->machine_done);
> +        o->machine_done.notify = NULL;
> +    }
>  }
>  
>  static void vfu_object_class_init(ObjectClass *klass, void *data)
> -- 
> 2.20.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 11/18] vfio-user: find and init PCI device
  2022-01-19 21:42 ` [PATCH v5 11/18] vfio-user: find and init PCI device Jagannathan Raman
@ 2022-01-25 14:48   ` Stefan Hajnoczi
  2022-01-26  3:14     ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 14:48 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 2837 bytes --]

On Wed, Jan 19, 2022 at 04:42:00PM -0500, Jagannathan Raman wrote:
> Find the PCI device with specified id. Initialize the device context
> with the QEMU PCI device
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 60 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 60 insertions(+)
> 
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 810a7c3943..10db78eb8d 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -44,6 +44,8 @@
>  #include "qemu/notify.h"
>  #include "sysemu/sysemu.h"
>  #include "libvfio-user.h"
> +#include "hw/qdev-core.h"
> +#include "hw/pci/pci.h"
>  
>  #define TYPE_VFU_OBJECT "x-vfio-user-server"
>  OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
> @@ -89,6 +91,10 @@ struct VfuObject {
>      Notifier machine_done;
>  
>      vfu_ctx_t *vfu_ctx;
> +
> +    PCIDevice *pci_dev;
> +
> +    Error *unplug_blocker;
>  };
>  
>  static void vfu_object_init_ctx(VfuObject *o, Error **errp);
> @@ -161,6 +167,9 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
>  static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>  {
>      ERRP_GUARD();
> +    DeviceState *dev = NULL;
> +    vfu_pci_type_t pci_type = VFU_PCI_TYPE_CONVENTIONAL;
> +    int ret;
>  
>      if (o->vfu_ctx || !o->socket || !o->device ||
>              !phase_check(PHASE_MACHINE_READY)) {
> @@ -179,6 +188,49 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>          error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
>          return;
>      }
> +
> +    dev = qdev_find_recursive(sysbus_get_default(), o->device);
> +    if (dev == NULL) {
> +        error_setg(errp, "vfu: Device %s not found", o->device);
> +        goto fail;
> +    }
> +
> +    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
> +        error_setg(errp, "vfu: %s not a PCI device", o->device);
> +        goto fail;
> +    }
> +
> +    o->pci_dev = PCI_DEVICE(dev);
> +
> +    if (pci_is_express(o->pci_dev)) {
> +        pci_type = VFU_PCI_TYPE_EXPRESS;
> +    }
> +
> +    ret = vfu_pci_init(o->vfu_ctx, pci_type, PCI_HEADER_TYPE_NORMAL, 0);
> +    if (ret < 0) {
> +        error_setg(errp,
> +                   "vfu: Failed to attach PCI device %s to context - %s",
> +                   o->device, strerror(errno));
> +        goto fail;
> +    }
> +
> +    error_setg(&o->unplug_blocker, "%s is in use", o->device);

More detailed error message:
"x-vfio-user-server for %s must be deleted before unplugging"

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 12/18] vfio-user: run vfio-user context
  2022-01-19 21:42 ` [PATCH v5 12/18] vfio-user: run vfio-user context Jagannathan Raman
@ 2022-01-25 15:10   ` Stefan Hajnoczi
  2022-01-26  3:26     ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 15:10 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 3849 bytes --]

On Wed, Jan 19, 2022 at 04:42:01PM -0500, Jagannathan Raman wrote:
> Setup a handler to run vfio-user context. The context is driven by
> messages to the file descriptor associated with it - get the fd for
> the context and hook up the handler with it
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  qapi/misc.json            | 23 ++++++++++
>  hw/remote/vfio-user-obj.c | 90 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 112 insertions(+), 1 deletion(-)
> 
> diff --git a/qapi/misc.json b/qapi/misc.json
> index e8054f415b..f0791d3311 100644
> --- a/qapi/misc.json
> +++ b/qapi/misc.json
> @@ -527,3 +527,26 @@
>   'data': { '*option': 'str' },
>   'returns': ['CommandLineOptionInfo'],
>   'allow-preconfig': true }
> +
> +##
> +# @VFU_CLIENT_HANGUP:
> +#
> +# Emitted when the client of a TYPE_VFIO_USER_SERVER closes the
> +# communication channel
> +#
> +# @device: ID of attached PCI device
> +#
> +# @path: path of the socket

This assumes a UNIX domain socket path was given. It doesn't work well
with file descriptor passing. The x-vfio-user-server is an object with
a unique QEMU Object Model path (the last path component is its id). You
can get the id like this:

  object_get_canonical_path_component(OBJECT(o))

I suggest dropping @path and including the server object's id instead.

> +#
> +# Since: 6.3
> +#
> +# Example:
> +#
> +# <- { "event": "VFU_CLIENT_HANGUP",
> +#      "data": { "device": "lsi1",
> +#                "path": "/tmp/vfu1-sock" },
> +#      "timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
> +#
> +##
> +{ 'event': 'VFU_CLIENT_HANGUP',
> +  'data': { 'device': 'str', 'path': 'str' } }
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index 10db78eb8d..91d49a221f 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -27,6 +27,9 @@
>   *
>   * device - id of a device on the server, a required option. PCI devices
>   *          alone are supported presently.
> + *
> + * notes - x-vfio-user-server could block IO and monitor during the
> + *         initialization phase.
>   */
>  
>  #include "qemu/osdep.h"
> @@ -41,11 +44,14 @@
>  #include "hw/remote/machine.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-visit-sockets.h"
> +#include "qapi/qapi-events-misc.h"
>  #include "qemu/notify.h"
> +#include "qemu/thread.h"
>  #include "sysemu/sysemu.h"
>  #include "libvfio-user.h"
>  #include "hw/qdev-core.h"
>  #include "hw/pci/pci.h"
> +#include "qemu/timer.h"
>  
>  #define TYPE_VFU_OBJECT "x-vfio-user-server"
>  OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
> @@ -95,6 +101,8 @@ struct VfuObject {
>      PCIDevice *pci_dev;
>  
>      Error *unplug_blocker;
> +
> +    int vfu_poll_fd;
>  };
>  
>  static void vfu_object_init_ctx(VfuObject *o, Error **errp);
> @@ -144,6 +152,68 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
>      vfu_object_init_ctx(o, errp);
>  }
>  
> +static void vfu_object_ctx_run(void *opaque)
> +{
> +    VfuObject *o = opaque;
> +    int ret = -1;
> +
> +    while (ret != 0) {
> +        ret = vfu_run_ctx(o->vfu_ctx);
> +        if (ret < 0) {
> +            if (errno == EINTR) {
> +                continue;
> +            } else if (errno == ENOTCONN) {
> +                qapi_event_send_vfu_client_hangup(o->device,
> +                                                  o->socket->u.q_unix.path);
> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);

Do we also stop monitoring o->vfu_poll_fd when object-del is used to
delete the x-vfio-user-server object?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 13/18] vfio-user: handle PCI config space accesses
  2022-01-19 21:42 ` [PATCH v5 13/18] vfio-user: handle PCI config space accesses Jagannathan Raman
@ 2022-01-25 15:13   ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 15:13 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 529 bytes --]

On Wed, Jan 19, 2022 at 04:42:02PM -0500, Jagannathan Raman wrote:
> Define and register handlers for PCI config space accesses
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  hw/remote/vfio-user-obj.c | 45 +++++++++++++++++++++++++++++++++++++++
>  hw/remote/trace-events    |  2 ++
>  2 files changed, 47 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 16/18] vfio-user: handle device interrupts
  2022-01-19 21:42 ` [PATCH v5 16/18] vfio-user: handle device interrupts Jagannathan Raman
@ 2022-01-25 15:25   ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 15:25 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 657 bytes --]

On Wed, Jan 19, 2022 at 04:42:05PM -0500, Jagannathan Raman wrote:
> Forward remote device's interrupts to the guest
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>  include/hw/pci/pci.h      |  6 +++
>  hw/pci/msi.c              | 13 +++++-
>  hw/pci/msix.c             | 12 +++++-
>  hw/remote/vfio-user-obj.c | 89 +++++++++++++++++++++++++++++++++++++++
>  hw/remote/trace-events    |  1 +
>  5 files changed, 119 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
  2022-01-19 21:42 ` [PATCH v5 17/18] vfio-user: register handlers to facilitate migration Jagannathan Raman
@ 2022-01-25 15:48   ` Stefan Hajnoczi
  2022-01-27 17:04     ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 15:48 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel, f4bug,
	marcandre.lureau, thanos.makatos, pbonzini, eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 3070 bytes --]

On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote:
> +     * The client subsequetly asks the remote server for any data that

subsequently

> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
> +    static int migrated_devs;
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    /**
> +     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
> +     * VMSD data from source is not available at RESUME state.
> +     * Working on a fix for this.
> +     */
> +    if (!o->vfu_mig_file) {
> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
> +    }
> +
> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
> +    if (ret) {
> +        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
> +        return;
> +    }
> +
> +    qemu_file_shutdown(o->vfu_mig_file);
> +    o->vfu_mig_file = NULL;
> +
> +    /* VFU_MIGR_STATE_RUNNING begins here */
> +    if (++migrated_devs == k->nr_devs) {

When is this counter reset so migration can be tried again if it
fails/cancels?

> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
> +                                 uint64_t size, uint64_t offset)
> +{
> +    VfuObject *o = vfu_get_private(vfu_ctx);
> +
> +    if (offset > o->vfu_mig_buf_size) {
> +        return -1;
> +    }
> +
> +    if ((offset + size) > o->vfu_mig_buf_size) {
> +        warn_report("vfu: buffer overflow - check pending_bytes");
> +        size = o->vfu_mig_buf_size - offset;
> +    }
> +
> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
> +
> +    o->vfu_mig_buf_pending -= size;

This assumes that the caller increments offset by size each time. If
that assumption is okay, then we can just trust offset and don't need to
do arithmetic on vfu_mig_buf_pending. If that assumption is not correct,
then the code needs to be extended to safely update vfu_mig_buf_pending
when offset jumps around arbitrarily between calls.

> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
> +{
> +    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
> +    const VMStateField *field = NULL;
> +    uint64_t size = 0;
> +
> +    if (!dc->vmsd) {
> +        return 0;
> +    }
> +
> +    field = dc->vmsd->fields;
> +    while (field && field->name) {
> +        size += vmstate_size(pci_dev, field);
> +        field++;
> +    }
> +
> +    return size;
> +}

This function looks incorrect because it ignores subsections as well as
runtime behavior during save(). Although VMStateDescription is partially
declarative, there is still a bunch of imperative code that can write to
the QEMUFile at save() time so there's no way of knowing the size ahead
of time.

I asked this in a previous revision of this series but I'm not sure if
it was answered: is it really necessary to know the size of the vmstate?
I thought the VFIO migration interface is designed to support
streaming reads/writes. We could choose a fixed size like 64KB and
stream the vmstate in 64KB chunks.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 00/18] vfio-user server in QEMU
  2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
                   ` (17 preceding siblings ...)
  2022-01-19 21:42 ` [PATCH v5 18/18] vfio-user: avocado tests for vfio-user Jagannathan Raman
@ 2022-01-25 16:00 ` Stefan Hajnoczi
  2022-01-26  5:04   ` Jag Raman
  18 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-25 16:00 UTC (permalink / raw)
  To: Jagannathan Raman
  Cc: eduardo, elena.ufimtseva, john.g.johnson, berrange, bleal,
	john.levon, mst, peter.maydell, armbru, quintela, qemu-devel,
	f4bug, marcandre.lureau, thanos.makatos, pbonzini, kwolf, eblake,
	dgilbert

[-- Attachment #1: Type: text/plain, Size: 1644 bytes --]

Hi Jag,
Thanks for this latest revision. The biggest outstanding question I have
is about the isolated address spaces design.

This patch series needs a PCIBus with its own Memory Space, I/O Space,
and interrupts. That way a single QEMU process can host vfio-user
servers that different VMs connect to. They all need isolated address
spaces so that mapping a BAR in Device A does not conflict with mapping
a BAR in Device B.

The current approach adds special code to hw/pci/pci.c so that custom
AddressSpace can be set up. The isolated PCIBus is an automatically
created PCIe root port that's a child of the machine's main PCI bus. On
one hand it's neat because QEMU's assumption that there is only one
root SysBus isn't violated. On the other hand it seems like a special
case hack for PCI and I'm not sure in what sense these PCIBusses are
really children of the machine's main PCI bus since they don't share or
interact in any way.

Another approach that came to mind is to allow multiple root SysBusses.
Each vfio-user server would need its own SysBus and put a regular PCI
host onto that isolated SysBus without modifying hw/pci/pci.c with a
special case. The downside to this is that violating the single SysBus
assumption probably breaks monitor commands that rely on
qdev_find_recursive() and friends. It seems cleaner than adding isolated
address spaces to PCI specifically, but also raises the question if
multiple machine instances are needed (which would raise even more
questions).

I wanted to raise this to see if Peter, Kevin, Michael, and others are
happy with the current approach or have ideas for a clean solution.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine
  2022-01-25 10:32   ` Stefan Hajnoczi
@ 2022-01-25 18:12     ` Jag Raman
  2022-01-26  9:35       ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-01-25 18:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 25, 2022, at 5:32 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:41:55PM -0500, Jagannathan Raman wrote:
>> Allow hotplugging of PCI(e) devices to remote machine
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> hw/remote/machine.c | 29 +++++++++++++++++++++++++++++
>> 1 file changed, 29 insertions(+)
> 
> Why is this code necessary? I expected the default hotplug behavior to

I just discovered that TYPE_REMOTE_MACHINE wasn't setting up a hotplug
handler for the root PCI bus.

Looks like, some of the machines don’t support hotplugging PCI devices. I see
that the ‘pc’ machine does support hotplug, whereas ‘q35’ does not.

We didn’t check hotplug in multiprocess-qemu previously because it was limited
to one device per process, and the use cases attached the devices via
command line.

> pretty much handle this case - hotplugging device types that the bus
> doesn't support should fail and unplug should already unparent/unrealize
> the device.

OK, that makes sense. We don’t need to test the device type during
plug and unplug.

Therefore, I don’t think we need a callback for the plug operation. We
could set HotplugHandlerClass->unplug callback to the default
qdev_simple_device_unplug_cb() callback.

—
Jag

> 
>> 
>> diff --git a/hw/remote/machine.c b/hw/remote/machine.c
>> index 952105eab5..220ff01aa9 100644
>> --- a/hw/remote/machine.c
>> +++ b/hw/remote/machine.c
>> @@ -54,14 +54,39 @@ static void remote_machine_init(MachineState *machine)
>> 
>>     pci_bus_irqs(pci_host->bus, remote_iohub_set_irq, remote_iohub_map_irq,
>>                  &s->iohub, REMOTE_IOHUB_NB_PIRQS);
>> +
>> +    qbus_set_hotplug_handler(BUS(pci_host->bus), OBJECT(s));
>> +}
>> +
>> +static void remote_machine_pre_plug_cb(HotplugHandler *hotplug_dev,
>> +                                       DeviceState *dev, Error **errp)
>> +{
>> +    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
>> +        error_setg(errp, "Only allowing PCI hotplug");
>> +    }
>> +}
>> +
>> +static void remote_machine_unplug_cb(HotplugHandler *hotplug_dev,
>> +                                     DeviceState *dev, Error **errp)
>> +{
>> +    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
>> +        error_setg(errp, "Only allowing PCI hot-unplug");
>> +        return;
>> +    }
>> +
>> +    qdev_unrealize(dev);
>> }
>> 
>> static void remote_machine_class_init(ObjectClass *oc, void *data)
>> {
>>     MachineClass *mc = MACHINE_CLASS(oc);
>> +    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);
>> 
>>     mc->init = remote_machine_init;
>>     mc->desc = "Experimental remote machine";
>> +
>> +    hc->pre_plug = remote_machine_pre_plug_cb;
>> +    hc->unplug = remote_machine_unplug_cb;
>> }
>> 
>> static const TypeInfo remote_machine = {
>> @@ -69,6 +94,10 @@ static const TypeInfo remote_machine = {
>>     .parent = TYPE_MACHINE,
>>     .instance_size = sizeof(RemoteMachineState),
>>     .class_init = remote_machine_class_init,
>> +    .interfaces = (InterfaceInfo[]) {
>> +        { TYPE_HOTPLUG_HANDLER },
>> +        { }
>> +    }
>> };
>> 
>> static void remote_machine_register_types(void)
>> -- 
>> 2.20.1
>> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-20 15:20     ` Jag Raman
@ 2022-01-25 18:38       ` Dr. David Alan Gilbert
  2022-01-26  5:27         ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-25 18:38 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, john.levon, Michael S. Tsirkin, armbru, quintela,
	Philippe Mathieu-Daudé,
	qemu-devel, Marc-André Lureau, Stefan Hajnoczi,
	thanos.makatos, Paolo Bonzini, Eric Blake

* Jag Raman (jag.raman@oracle.com) wrote:
> 
> 
> > On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> >> Allow PCI buses to be part of isolated CPU address spaces. This has a
> >> niche usage.
> >> 
> >> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> >> the same machine/server. This would cause address space collision as
> >> well as be a security vulnerability. Having separate address spaces for
> >> each PCI bus would solve this problem.
> > 
> > Fascinating, but I am not sure I understand. any examples?
> 
> Hi Michael!
> 
> multiprocess QEMU and vfio-user implement a client-server model to allow
> out-of-process emulation of devices. The client QEMU, which makes ioctls
> to the kernel and runs VCPUs, could attach devices running in a server
> QEMU. The server QEMU needs access to parts of the client’s RAM to
> perform DMA.

Do you ever have the opposite problem? i.e. when an emulated PCI device
exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
that the client can see.  What happens if two emulated devices need to
access each others emulated address space?

Dave

> In the case where multiple clients attach devices that are running on the
> same server, we need to ensure that each devices has isolated memory
> ranges. This ensures that the memory space of one device is not visible
> to other devices in the same server.
>  
> > 
> > I also wonder whether this special type could be modelled like a special
> > kind of iommu internally.
> 
> Could you please provide some more details on the design?
> 
> > 
> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >> ---
> >> include/hw/pci/pci.h     |  2 ++
> >> include/hw/pci/pci_bus.h | 17 +++++++++++++++++
> >> hw/pci/pci.c             | 17 +++++++++++++++++
> >> hw/pci/pci_bridge.c      |  5 +++++
> >> 4 files changed, 41 insertions(+)
> >> 
> >> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >> index 023abc0f79..9bb4472abc 100644
> >> --- a/include/hw/pci/pci.h
> >> +++ b/include/hw/pci/pci.h
> >> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
> >> int pci_device_load(PCIDevice *s, QEMUFile *f);
> >> MemoryRegion *pci_address_space(PCIDevice *dev);
> >> MemoryRegion *pci_address_space_io(PCIDevice *dev);
> >> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
> >> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
> >> 
> >> /*
> >>  * Should not normally be used by devices. For use by sPAPR target
> >> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> >> index 347440d42c..d78258e79e 100644
> >> --- a/include/hw/pci/pci_bus.h
> >> +++ b/include/hw/pci/pci_bus.h
> >> @@ -39,9 +39,26 @@ struct PCIBus {
> >>     void *irq_opaque;
> >>     PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
> >>     PCIDevice *parent_dev;
> >> +
> >>     MemoryRegion *address_space_mem;
> >>     MemoryRegion *address_space_io;
> >> 
> >> +    /**
> >> +     * Isolated address spaces - these allow the PCI bus to be part
> >> +     * of an isolated address space as opposed to the global
> >> +     * address_space_memory & address_space_io.
> > 
> > Are you sure address_space_memory & address_space_io are
> > always global? even in the case of an iommu?
> 
> On the CPU side of the Root Complex, I believe address_space_memory
> & address_space_io are global.
> 
> In the vfio-user case, devices on the same machine (TYPE_REMOTE_MACHINE)
> could be attached to different clients VMs. Each client would have their own address
> space for their CPUs. With isolated address spaces, we ensure that the devices
> see the address space of the CPUs they’re attached to.
> 
> Not sure if it’s OK to share weblinks in this mailing list, please let me know if that’s
> not preferred. But I’m referring to the terminology used in the following block diagram:
> https://en.wikipedia.org/wiki/Root_complex#/media/File:Example_PCI_Express_Topology.svg
> 
> > 
> >> This allows the
> >> +     * bus to be attached to CPUs from different machines. The
> >> +     * following is not used used commonly.
> >> +     *
> >> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
> >> +     * VM clients,
> > 
> > what are VM clients?
> 
> It’s the client in the client - server model explained above.
> 
> Thank you!
> --
> Jag
> 
> > 
> >> as such it needs the PCI buses in the same machine
> >> +     * to be part of different CPU address spaces. The following is
> >> +     * useful in that scenario.
> >> +     *
> >> +     */
> >> +    AddressSpace *isol_as_mem;
> >> +    AddressSpace *isol_as_io;
> >> +
> >>     QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
> >>     QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
> >> 
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 5d30f9ca60..d5f1c6c421 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
> >>     bus->slot_reserved_mask = 0x0;
> >>     bus->address_space_mem = address_space_mem;
> >>     bus->address_space_io = address_space_io;
> >> +    bus->isol_as_mem = NULL;
> >> +    bus->isol_as_io = NULL;
> >>     bus->flags |= PCI_BUS_IS_ROOT;
> >> 
> >>     /* host bridge */
> >> @@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
> >>     return pci_get_bus(dev)->address_space_io;
> >> }
> >> 
> >> +AddressSpace *pci_isol_as_mem(PCIDevice *dev)
> >> +{
> >> +    return pci_get_bus(dev)->isol_as_mem;
> >> +}
> >> +
> >> +AddressSpace *pci_isol_as_io(PCIDevice *dev)
> >> +{
> >> +    return pci_get_bus(dev)->isol_as_io;
> >> +}
> >> +
> >> static void pci_device_class_init(ObjectClass *klass, void *data)
> >> {
> >>     DeviceClass *k = DEVICE_CLASS(klass);
> >> @@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
> >> 
> >> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >> {
> >> +    AddressSpace *iommu_as = NULL;
> >>     PCIBus *bus = pci_get_bus(dev);
> >>     PCIBus *iommu_bus = bus;
> >>     uint8_t devfn = dev->devfn;
> >> @@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>     if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
> >>         return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> >>     }
> >> +    iommu_as = pci_isol_as_mem(dev);
> >> +    if (iommu_as) {
> >> +        return iommu_as;
> >> +    }
> >>     return &address_space_memory;
> >> }
> >> 
> >> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> >> index da34c8ebcd..98366768d2 100644
> >> --- a/hw/pci/pci_bridge.c
> >> +++ b/hw/pci/pci_bridge.c
> >> @@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
> >>     sec_bus->address_space_io = &br->address_space_io;
> >>     memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
> >>                        4 * GiB);
> >> +
> >> +    /* This PCI bridge puts the sec_bus in its parent's address space */
> >> +    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
> >> +    sec_bus->isol_as_io = pci_isol_as_io(dev);
> >> +
> >>     br->windows = pci_bridge_region_init(br);
> >>     QLIST_INIT(&sec_bus->child);
> >>     QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
> >> -- 
> >> 2.20.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 07/18] vfio-user: set qdev bus callbacks for remote machine
  2022-01-25 10:44   ` Stefan Hajnoczi
@ 2022-01-25 21:12     ` Jag Raman
  2022-01-26  9:37       ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-01-25 21:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, Beraldo Leal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 25, 2022, at 5:44 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:41:56PM -0500, Jagannathan Raman wrote:
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> hw/remote/machine.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 57 insertions(+)
>> 
>> diff --git a/hw/remote/machine.c b/hw/remote/machine.c
>> index 220ff01aa9..221a8430c1 100644
>> --- a/hw/remote/machine.c
>> +++ b/hw/remote/machine.c
>> @@ -22,6 +22,60 @@
>> #include "hw/pci/pci_host.h"
>> #include "hw/remote/iohub.h"
>> 
>> +static bool remote_machine_get_bus(const char *type, BusState **bus,
>> +                                   Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +    RemoteMachineState *s = REMOTE_MACHINE(current_machine);
>> +    BusState *root_bus = NULL;
>> +    PCIBus *new_pci_bus = NULL;
>> +
>> +    if (!bus) {
>> +        error_setg(errp, "Invalid argument");
>> +        return false;
>> +    }
>> +
>> +    if (strcmp(type, TYPE_PCI_BUS) && strcmp(type, TYPE_PCI_BUS)) {
>> +        return true;
>> +    }
>> +
>> +    root_bus = qbus_find_recursive(sysbus_get_default(), NULL, TYPE_PCIE_BUS);
>> +    if (!root_bus) {
>> +        error_setg(errp, "Unable to find root PCI device");
>> +        return false;
>> +    }
>> +
>> +    new_pci_bus = pci_isol_bus_new(root_bus, type, errp);
>> +    if (!new_pci_bus) {
>> +        return false;
>> +    }
>> +
>> +    *bus = BUS(new_pci_bus);
>> +
>> +    pci_bus_irqs(new_pci_bus, remote_iohub_set_irq, remote_iohub_map_irq,
>> +                 &s->iohub, REMOTE_IOHUB_NB_PIRQS);
>> +
>> +    return true;
>> +}
> 
> Can the user create the same PCI bus via QMP commands? If so, then this

I think there is a way we could achieve it.

When I looked around, both the command line and the QMP didn’t have a direct
way to create a bus. However, there are some indirect ways. For example, the
TYPE_LSI53C895A device creates a SCSI bus to attach SCSI devices. Similarly,
there are some special PCI devices like TYPE_PCI_BRIDGE which create a
secondary PCI bus.

Similarly, we could implement a PCI device that creates a PCI bus with
isolated address spaces.

> is just a convenience that saves the extra step. Or is there some magic
> that cannot be done via QMP device_add?
> 
> I'm asking because there are 3 objects involved and I'd like to
> understand the lifecycle/dependencies:
> 1. The PCIDevice we wish to export.
> 2. The PCIBus with isolated address spaces that contains the PCIDevice.
> 3. The vfio-user server that exports a given PCIDevice.
> 
> Users can already create the PCIDevice via hotplug and the vfio-user
> server via object-add. So if there's no magic they could also create the
> PCI bus:
> 1. device_add ...some PCI bus stuff here...,id=isol-pci-bus0
> 2. device_add ...the PCIDevice...,bus=isol-pci-bus0,id=mydev
> 3. object-add x-vfio-user-server,device=mydev

We are able to do 2 & 3 already. We could introduce a PCI device that
creates an isolated PCI bus. That would cover step 1 outlined above.

> 
> Unplug would work in the reverse order.
> 
> It may be more convenient to automatically create a PCIBus when the
> PCIDevice is hotplugged, but this kind of magic also has drawbacks
> (hidden devices, naming collisions, etc).

OK, makes sense.

--
Jag

> 
>> +
>> +static bool remote_machine_put_bus(BusState *bus, Error **errp)
>> +{
>> +    PCIBus *pci_bus = NULL;
>> +
>> +    if (!bus) {
>> +        error_setg(errp, "Invalid argument");
>> +        return false;
>> +    }
>> +
>> +    if (!object_dynamic_cast(OBJECT(bus), TYPE_PCI_BUS)) {
>> +        return true;
>> +    }
>> +
>> +    pci_bus = PCI_BUS(bus);
>> +
>> +    return pci_isol_bus_free(pci_bus, errp);
>> +}
>> +
>> static void remote_machine_init(MachineState *machine)
>> {
>>     MemoryRegion *system_memory, *system_io, *pci_memory;
>> @@ -56,6 +110,9 @@ static void remote_machine_init(MachineState *machine)
>>                  &s->iohub, REMOTE_IOHUB_NB_PIRQS);
>> 
>>     qbus_set_hotplug_handler(BUS(pci_host->bus), OBJECT(s));
>> +
>> +    qdev_set_bus_cbs(remote_machine_get_bus, remote_machine_put_bus,
>> +                     &error_fatal);
>> }
>> 
>> static void remote_machine_pre_plug_cb(HotplugHandler *hotplug_dev,
>> -- 
>> 2.20.1
>> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 11/18] vfio-user: find and init PCI device
  2022-01-25 14:48   ` Stefan Hajnoczi
@ 2022-01-26  3:14     ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-26  3:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, pbonzini, eblake,
	dgilbert



> On Jan 25, 2022, at 9:48 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:42:00PM -0500, Jagannathan Raman wrote:
>> Find the PCI device with specified id. Initialize the device context
>> with the QEMU PCI device
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> hw/remote/vfio-user-obj.c | 60 +++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 60 insertions(+)
>> 
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 810a7c3943..10db78eb8d 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -44,6 +44,8 @@
>> #include "qemu/notify.h"
>> #include "sysemu/sysemu.h"
>> #include "libvfio-user.h"
>> +#include "hw/qdev-core.h"
>> +#include "hw/pci/pci.h"
>> 
>> #define TYPE_VFU_OBJECT "x-vfio-user-server"
>> OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
>> @@ -89,6 +91,10 @@ struct VfuObject {
>>     Notifier machine_done;
>> 
>>     vfu_ctx_t *vfu_ctx;
>> +
>> +    PCIDevice *pci_dev;
>> +
>> +    Error *unplug_blocker;
>> };
>> 
>> static void vfu_object_init_ctx(VfuObject *o, Error **errp);
>> @@ -161,6 +167,9 @@ static void vfu_object_machine_done(Notifier *notifier, void *data)
>> static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>> {
>>     ERRP_GUARD();
>> +    DeviceState *dev = NULL;
>> +    vfu_pci_type_t pci_type = VFU_PCI_TYPE_CONVENTIONAL;
>> +    int ret;
>> 
>>     if (o->vfu_ctx || !o->socket || !o->device ||
>>             !phase_check(PHASE_MACHINE_READY)) {
>> @@ -179,6 +188,49 @@ static void vfu_object_init_ctx(VfuObject *o, Error **errp)
>>         error_setg(errp, "vfu: Failed to create context - %s", strerror(errno));
>>         return;
>>     }
>> +
>> +    dev = qdev_find_recursive(sysbus_get_default(), o->device);
>> +    if (dev == NULL) {
>> +        error_setg(errp, "vfu: Device %s not found", o->device);
>> +        goto fail;
>> +    }
>> +
>> +    if (!object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
>> +        error_setg(errp, "vfu: %s not a PCI device", o->device);
>> +        goto fail;
>> +    }
>> +
>> +    o->pci_dev = PCI_DEVICE(dev);
>> +
>> +    if (pci_is_express(o->pci_dev)) {
>> +        pci_type = VFU_PCI_TYPE_EXPRESS;
>> +    }
>> +
>> +    ret = vfu_pci_init(o->vfu_ctx, pci_type, PCI_HEADER_TYPE_NORMAL, 0);
>> +    if (ret < 0) {
>> +        error_setg(errp,
>> +                   "vfu: Failed to attach PCI device %s to context - %s",
>> +                   o->device, strerror(errno));
>> +        goto fail;
>> +    }
>> +
>> +    error_setg(&o->unplug_blocker, "%s is in use", o->device);
> 
> More detailed error message:
> "x-vfio-user-server for %s must be deleted before unplugging"

Got it, thank you!

--
Jag

> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 12/18] vfio-user: run vfio-user context
  2022-01-25 15:10   ` Stefan Hajnoczi
@ 2022-01-26  3:26     ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-26  3:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 25, 2022, at 10:10 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:42:01PM -0500, Jagannathan Raman wrote:
>> Setup a handler to run vfio-user context. The context is driven by
>> messages to the file descriptor associated with it - get the fd for
>> the context and hook up the handler with it
>> 
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>> qapi/misc.json            | 23 ++++++++++
>> hw/remote/vfio-user-obj.c | 90 ++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 112 insertions(+), 1 deletion(-)
>> 
>> diff --git a/qapi/misc.json b/qapi/misc.json
>> index e8054f415b..f0791d3311 100644
>> --- a/qapi/misc.json
>> +++ b/qapi/misc.json
>> @@ -527,3 +527,26 @@
>>  'data': { '*option': 'str' },
>>  'returns': ['CommandLineOptionInfo'],
>>  'allow-preconfig': true }
>> +
>> +##
>> +# @VFU_CLIENT_HANGUP:
>> +#
>> +# Emitted when the client of a TYPE_VFIO_USER_SERVER closes the
>> +# communication channel
>> +#
>> +# @device: ID of attached PCI device
>> +#
>> +# @path: path of the socket
> 
> This assumes a UNIX domain socket path was given. It doesn't work well
> with file descriptor passing. The x-vfio-user-server is an object with
> a unique QEMU Object Model path (the last path component is its id). You
> can get the id like this:
> 
>  object_get_canonical_path_component(OBJECT(o))

I was also wondering how to get the object ID. Thank you for the pointer!

> 
> I suggest dropping @path and including the server object's id instead.

OK, will do.

> 
>> +#
>> +# Since: 6.3
>> +#
>> +# Example:
>> +#
>> +# <- { "event": "VFU_CLIENT_HANGUP",
>> +#      "data": { "device": "lsi1",
>> +#                "path": "/tmp/vfu1-sock" },
>> +#      "timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
>> +#
>> +##
>> +{ 'event': 'VFU_CLIENT_HANGUP',
>> +  'data': { 'device': 'str', 'path': 'str' } }
>> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
>> index 10db78eb8d..91d49a221f 100644
>> --- a/hw/remote/vfio-user-obj.c
>> +++ b/hw/remote/vfio-user-obj.c
>> @@ -27,6 +27,9 @@
>>  *
>>  * device - id of a device on the server, a required option. PCI devices
>>  *          alone are supported presently.
>> + *
>> + * notes - x-vfio-user-server could block IO and monitor during the
>> + *         initialization phase.
>>  */
>> 
>> #include "qemu/osdep.h"
>> @@ -41,11 +44,14 @@
>> #include "hw/remote/machine.h"
>> #include "qapi/error.h"
>> #include "qapi/qapi-visit-sockets.h"
>> +#include "qapi/qapi-events-misc.h"
>> #include "qemu/notify.h"
>> +#include "qemu/thread.h"
>> #include "sysemu/sysemu.h"
>> #include "libvfio-user.h"
>> #include "hw/qdev-core.h"
>> #include "hw/pci/pci.h"
>> +#include "qemu/timer.h"
>> 
>> #define TYPE_VFU_OBJECT "x-vfio-user-server"
>> OBJECT_DECLARE_TYPE(VfuObject, VfuObjectClass, VFU_OBJECT)
>> @@ -95,6 +101,8 @@ struct VfuObject {
>>     PCIDevice *pci_dev;
>> 
>>     Error *unplug_blocker;
>> +
>> +    int vfu_poll_fd;
>> };
>> 
>> static void vfu_object_init_ctx(VfuObject *o, Error **errp);
>> @@ -144,6 +152,68 @@ static void vfu_object_set_device(Object *obj, const char *str, Error **errp)
>>     vfu_object_init_ctx(o, errp);
>> }
>> 
>> +static void vfu_object_ctx_run(void *opaque)
>> +{
>> +    VfuObject *o = opaque;
>> +    int ret = -1;
>> +
>> +    while (ret != 0) {
>> +        ret = vfu_run_ctx(o->vfu_ctx);
>> +        if (ret < 0) {
>> +            if (errno == EINTR) {
>> +                continue;
>> +            } else if (errno == ENOTCONN) {
>> +                qapi_event_send_vfu_client_hangup(o->device,
>> +                                                  o->socket->u.q_unix.path);
>> +                qemu_set_fd_handler(o->vfu_poll_fd, NULL, NULL, NULL);
> 
> Do we also stop monitoring o->vfu_poll_fd when object-del is used to
> delete the x-vfio-user-server object?

Yes, we should to stop monitoring the o->vfu_poll_fd during object-del. Will do so.

--
Jag



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 18/18] vfio-user: avocado tests for vfio-user
  2022-01-19 21:42 ` [PATCH v5 18/18] vfio-user: avocado tests for vfio-user Jagannathan Raman
@ 2022-01-26  4:25   ` Philippe Mathieu-Daudé via
  2022-01-26 15:12     ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Philippe Mathieu-Daudé via @ 2022-01-26  4:25 UTC (permalink / raw)
  To: Jagannathan Raman, qemu-devel
  Cc: stefanha, marcandre.lureau, pbonzini, bleal, berrange, eduardo,
	mst, marcel.apfelbaum, eblake, armbru, quintela, dgilbert,
	john.levon, thanos.makatos, elena.ufimtseva, john.g.johnson

Hi Jagannathan,

On 19/1/22 22:42, Jagannathan Raman wrote:
> Avocado tests for libvfio-user in QEMU - tests startup,
> hotplug and migration of the server object
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> ---
>   MAINTAINERS                |   1 +
>   tests/avocado/vfio-user.py | 225 +++++++++++++++++++++++++++++++++++++
>   2 files changed, 226 insertions(+)
>   create mode 100644 tests/avocado/vfio-user.py

> +class VfioUser(QemuSystemTest):
> +    """
> +    :avocado: tags=vfiouser
> +    """
> +    KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 '
> +    timeout = 20
> +
> +    @staticmethod
> +    def migration_finished(vm):
> +        res = vm.command('query-migrate')
> +        if 'status' in res:
> +            return res['status'] in ('completed', 'failed')

Do we need to check for failed migration in do_test_migrate()?

> +        else:
> +            return False

[...]

> +    def launch_server_hotplug(self, socket):
> +        server_vm = self.get_vm()
> +        server_vm.add_args('-machine', 'x-remote')
> +        server_vm.add_args('-nodefaults')
> +        server_vm.add_args('-device', 'lsi53c895a,id=lsi1')
> +        server_vm.launch()
> +        server_vm.command('human-monitor-command',
> +                          command_line='object_add x-vfio-user-server,'

Why not use qmp('object-add', ...) directly?

> +                                       'id=vfioobj,socket.type=unix,'
> +                                       'socket.path='+socket+',device=lsi1')
> +        return server_vm

Otherwise LGTM.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 00/18] vfio-user server in QEMU
  2022-01-25 16:00 ` [PATCH v5 00/18] vfio-user server in QEMU Stefan Hajnoczi
@ 2022-01-26  5:04   ` Jag Raman
  2022-01-26  9:56     ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-01-26  5:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, Beraldo Leal,
	john.levon, mst, peter.maydell, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, kwolf,
	eblake, dgilbert



> On Jan 25, 2022, at 11:00 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> Hi Jag,
> Thanks for this latest revision. The biggest outstanding question I have
> is about the isolated address spaces design.

Thank you for taking the time to review the patches, Stefan!

> 
> This patch series needs a PCIBus with its own Memory Space, I/O Space,
> and interrupts. That way a single QEMU process can host vfio-user
> servers that different VMs connect to. They all need isolated address
> spaces so that mapping a BAR in Device A does not conflict with mapping
> a BAR in Device B.
> 
> The current approach adds special code to hw/pci/pci.c so that custom
> AddressSpace can be set up. The isolated PCIBus is an automatically
> created PCIe root port that's a child of the machine's main PCI bus. On
> one hand it's neat because QEMU's assumption that there is only one
> root SysBus isn't violated. On the other hand it seems like a special
> case hack for PCI and I'm not sure in what sense these PCIBusses are
> really children of the machine's main PCI bus since they don't share or
> interact in any way.

We are discussing the automatic creation part you just mentioned in
the following email:
[PATCH v5 07/18] vfio-user: set qdev bus callbacks for remote machine

I agree that automatic creation of a parent bus is not ideal - we could
specify the parent bus as a separate option in the command-line or
QMP. This change would avoid modification to hw/pci/pci.c - the new
PCI bus could be created inplace during device creation/hotplug.

The following image gives an idea of the bus/device topology in the remote
machine, as implemented in the current series. Each secondary bus and
its children have isolated memory and IO spaces.
https://gitlab.com/jraman/qemu/-/commit/2e2ebf004894075ad8044739b0b16ce875114c4c

> 
> Another approach that came to mind is to allow multiple root SysBusses.
> Each vfio-user server would need its own SysBus and put a regular PCI
> host onto that isolated SysBus without modifying hw/pci/pci.c with a
> special case. The downside to this is that violating the single SysBus
> assumption probably breaks monitor commands that rely on
> qdev_find_recursive() and friends. It seems cleaner than adding isolated
> address spaces to PCI specifically, but also raises the question if
> multiple machine instances are needed (which would raise even more
> questions).

Based on further discussion with Stefan, I got some clarity. We could consider one
more option as well - somewhere in-between multiple root SysBuses and the topology
discussed above (with secondary PCI buses). We could implement a
TYPE_SYS_BUS_DEVICE that creates a root PCI bus with isolated memory ranges.
Something along the lines in the following diagram:
https://gitlab.com/jraman/qemu/-/commit/81f6a998278a2a795be0db7acdeb1caa2d6744fb

An example set of QMP commands to attach PCI devices would be:
device_add pci-root-bus,id=rb1
device_add <driver>,id=mydev,bus=rb1
object-add x-vfio-user-server,device=mydev

where ‘pci-root-bus’ is a TYPE_SYS_BUS_DEVICE that creates its own root PCI bus.

I’m very sorry if the above two web links disturb your review workflow.

> 
> I wanted to raise this to see if Peter, Kevin, Michael, and others are
> happy with the current approach or have ideas for a clean solution.

Looking forward to your comments.

Thank you!
--
Jag

> 
> Stefan


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-25 18:38       ` Dr. David Alan Gilbert
@ 2022-01-26  5:27         ` Jag Raman
  2022-01-26  9:45           ` Stefan Hajnoczi
  2022-01-26 18:13           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-26  5:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, john.levon, Michael S. Tsirkin, armbru, quintela,
	Philippe Mathieu-Daudé,
	qemu-devel, Marc-André Lureau, Stefan Hajnoczi,
	thanos.makatos, Paolo Bonzini, Eric Blake



> On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Jag Raman (jag.raman@oracle.com) wrote:
>> 
>> 
>>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
>>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
>>>> niche usage.
>>>> 
>>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
>>>> the same machine/server. This would cause address space collision as
>>>> well as be a security vulnerability. Having separate address spaces for
>>>> each PCI bus would solve this problem.
>>> 
>>> Fascinating, but I am not sure I understand. any examples?
>> 
>> Hi Michael!
>> 
>> multiprocess QEMU and vfio-user implement a client-server model to allow
>> out-of-process emulation of devices. The client QEMU, which makes ioctls
>> to the kernel and runs VCPUs, could attach devices running in a server
>> QEMU. The server QEMU needs access to parts of the client’s RAM to
>> perform DMA.
> 
> Do you ever have the opposite problem? i.e. when an emulated PCI device

That’s an interesting question.

> exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> that the client can see.  What happens if two emulated devices need to
> access each others emulated address space?

In this case, the kernel driver would map the destination’s chunk of internal RAM into
the DMA space of the source device. Then the source device could write to that
mapped address range, and the IOMMU should direct those writes to the
destination device.

I would like to take a closer look at the IOMMU implementation on how to achieve
this, and get back to you. I think the IOMMU would handle this. Could you please
point me to the IOMMU implementation you have in mind?

Thank you!
--
Jag

> 
> Dave
> 
>> In the case where multiple clients attach devices that are running on the
>> same server, we need to ensure that each devices has isolated memory
>> ranges. This ensures that the memory space of one device is not visible
>> to other devices in the same server.
>> 
>>> 
>>> I also wonder whether this special type could be modelled like a special
>>> kind of iommu internally.
>> 
>> Could you please provide some more details on the design?
>> 
>>> 
>>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>>>> ---
>>>> include/hw/pci/pci.h     |  2 ++
>>>> include/hw/pci/pci_bus.h | 17 +++++++++++++++++
>>>> hw/pci/pci.c             | 17 +++++++++++++++++
>>>> hw/pci/pci_bridge.c      |  5 +++++
>>>> 4 files changed, 41 insertions(+)
>>>> 
>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>> index 023abc0f79..9bb4472abc 100644
>>>> --- a/include/hw/pci/pci.h
>>>> +++ b/include/hw/pci/pci.h
>>>> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
>>>> int pci_device_load(PCIDevice *s, QEMUFile *f);
>>>> MemoryRegion *pci_address_space(PCIDevice *dev);
>>>> MemoryRegion *pci_address_space_io(PCIDevice *dev);
>>>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
>>>> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
>>>> 
>>>> /*
>>>> * Should not normally be used by devices. For use by sPAPR target
>>>> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
>>>> index 347440d42c..d78258e79e 100644
>>>> --- a/include/hw/pci/pci_bus.h
>>>> +++ b/include/hw/pci/pci_bus.h
>>>> @@ -39,9 +39,26 @@ struct PCIBus {
>>>>    void *irq_opaque;
>>>>    PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
>>>>    PCIDevice *parent_dev;
>>>> +
>>>>    MemoryRegion *address_space_mem;
>>>>    MemoryRegion *address_space_io;
>>>> 
>>>> +    /**
>>>> +     * Isolated address spaces - these allow the PCI bus to be part
>>>> +     * of an isolated address space as opposed to the global
>>>> +     * address_space_memory & address_space_io.
>>> 
>>> Are you sure address_space_memory & address_space_io are
>>> always global? even in the case of an iommu?
>> 
>> On the CPU side of the Root Complex, I believe address_space_memory
>> & address_space_io are global.
>> 
>> In the vfio-user case, devices on the same machine (TYPE_REMOTE_MACHINE)
>> could be attached to different clients VMs. Each client would have their own address
>> space for their CPUs. With isolated address spaces, we ensure that the devices
>> see the address space of the CPUs they’re attached to.
>> 
>> Not sure if it’s OK to share weblinks in this mailing list, please let me know if that’s
>> not preferred. But I’m referring to the terminology used in the following block diagram:
>> https://en.wikipedia.org/wiki/Root_complex#/media/File:Example_PCI_Express_Topology.svg
>> 
>>> 
>>>> This allows the
>>>> +     * bus to be attached to CPUs from different machines. The
>>>> +     * following is not used used commonly.
>>>> +     *
>>>> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
>>>> +     * VM clients,
>>> 
>>> what are VM clients?
>> 
>> It’s the client in the client - server model explained above.
>> 
>> Thank you!
>> --
>> Jag
>> 
>>> 
>>>> as such it needs the PCI buses in the same machine
>>>> +     * to be part of different CPU address spaces. The following is
>>>> +     * useful in that scenario.
>>>> +     *
>>>> +     */
>>>> +    AddressSpace *isol_as_mem;
>>>> +    AddressSpace *isol_as_io;
>>>> +
>>>>    QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
>>>>    QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
>>>> 
>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>> index 5d30f9ca60..d5f1c6c421 100644
>>>> --- a/hw/pci/pci.c
>>>> +++ b/hw/pci/pci.c
>>>> @@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
>>>>    bus->slot_reserved_mask = 0x0;
>>>>    bus->address_space_mem = address_space_mem;
>>>>    bus->address_space_io = address_space_io;
>>>> +    bus->isol_as_mem = NULL;
>>>> +    bus->isol_as_io = NULL;
>>>>    bus->flags |= PCI_BUS_IS_ROOT;
>>>> 
>>>>    /* host bridge */
>>>> @@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
>>>>    return pci_get_bus(dev)->address_space_io;
>>>> }
>>>> 
>>>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev)
>>>> +{
>>>> +    return pci_get_bus(dev)->isol_as_mem;
>>>> +}
>>>> +
>>>> +AddressSpace *pci_isol_as_io(PCIDevice *dev)
>>>> +{
>>>> +    return pci_get_bus(dev)->isol_as_io;
>>>> +}
>>>> +
>>>> static void pci_device_class_init(ObjectClass *klass, void *data)
>>>> {
>>>>    DeviceClass *k = DEVICE_CLASS(klass);
>>>> @@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>>>> 
>>>> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>> {
>>>> +    AddressSpace *iommu_as = NULL;
>>>>    PCIBus *bus = pci_get_bus(dev);
>>>>    PCIBus *iommu_bus = bus;
>>>>    uint8_t devfn = dev->devfn;
>>>> @@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>>    if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
>>>>        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>>>>    }
>>>> +    iommu_as = pci_isol_as_mem(dev);
>>>> +    if (iommu_as) {
>>>> +        return iommu_as;
>>>> +    }
>>>>    return &address_space_memory;
>>>> }
>>>> 
>>>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
>>>> index da34c8ebcd..98366768d2 100644
>>>> --- a/hw/pci/pci_bridge.c
>>>> +++ b/hw/pci/pci_bridge.c
>>>> @@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
>>>>    sec_bus->address_space_io = &br->address_space_io;
>>>>    memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
>>>>                       4 * GiB);
>>>> +
>>>> +    /* This PCI bridge puts the sec_bus in its parent's address space */
>>>> +    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
>>>> +    sec_bus->isol_as_io = pci_isol_as_io(dev);
>>>> +
>>>>    br->windows = pci_bridge_region_init(br);
>>>>    QLIST_INIT(&sec_bus->child);
>>>>    QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
>>>> -- 
>>>> 2.20.1
>> 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 05/18] qdev: unplug blocker for devices
  2022-01-25 14:43     ` Jag Raman
@ 2022-01-26  9:32       ` Stefan Hajnoczi
  2022-01-26 15:13         ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26  9:32 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert

[-- Attachment #1: Type: text/plain, Size: 3150 bytes --]

On Tue, Jan 25, 2022 at 02:43:33PM +0000, Jag Raman wrote:
> 
> 
> > On Jan 25, 2022, at 5:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Jan 19, 2022 at 04:41:54PM -0500, Jagannathan Raman wrote:
> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >> ---
> >> include/hw/qdev-core.h |  5 +++++
> >> softmmu/qdev-monitor.c | 35 +++++++++++++++++++++++++++++++++++
> >> 2 files changed, 40 insertions(+)
> >> 
> >> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> >> index eed2983072..67df5e0081 100644
> >> --- a/include/hw/qdev-core.h
> >> +++ b/include/hw/qdev-core.h
> >> @@ -193,6 +193,7 @@ struct DeviceState {
> >>     int instance_id_alias;
> >>     int alias_required_for_version;
> >>     ResettableState reset;
> >> +    GSList *unplug_blockers;
> >> };
> >> 
> >> struct DeviceListener {
> >> @@ -433,6 +434,10 @@ typedef bool (QDevPutBusFunc)(BusState *bus, Error **errp);
> >> bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
> >>                       Error **errp);
> >> 
> >> +int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp);
> >> +void qdev_del_unplug_blocker(DeviceState *dev, Error *reason);
> >> +bool qdev_unplug_blocked(DeviceState *dev, Error **errp);
> >> +
> >> /**
> >>  * GpioPolarity: Polarity of a GPIO line
> >>  *
> >> diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
> >> index 7306074019..1a169f89a2 100644
> >> --- a/softmmu/qdev-monitor.c
> >> +++ b/softmmu/qdev-monitor.c
> >> @@ -978,10 +978,45 @@ void qmp_device_del(const char *id, Error **errp)
> >>             return;
> >>         }
> >> 
> >> +        if (qdev_unplug_blocked(dev, errp)) {
> >> +            return;
> >> +        }
> >> +
> >>         qdev_unplug(dev, errp);
> >>     }
> >> }
> >> 
> >> +int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp)
> >> +{
> >> +    ERRP_GUARD();
> >> +
> >> +    if (!migration_is_idle()) {
> >> +        error_setg(errp, "migration is in progress");
> >> +        return -EBUSY;
> >> +    }
> > 
> > Why can this function not be called during migration?
> 
> Since ‘unplug_blockers' is a member of the device, I thought it wouldn’t be correct to
> allow changes to the device's state during migration.
> 
> I did weigh the following reasons against adding this check:
>   - unplug_blockers is not migrated to the destination anyway, so it doesn’t matter if
>     it changes after migration starts

Yes.

>   - whichever code/object that needs to add the blocker could add it at the destination
>     if needed

Yes.

> However, unlike qmp_device_add(), qmp_object_add() doesn’t reject during
> migration. As such, an object could add a blocker for the device when migration is
> in progress.
> 
> Would you prefer to throw a warning, or fully remove this test?

Adding an unplug blocker during migration seems safe to me. I would
remove this test.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine
  2022-01-25 18:12     ` Jag Raman
@ 2022-01-26  9:35       ` Stefan Hajnoczi
  2022-01-26 15:20         ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26  9:35 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert

[-- Attachment #1: Type: text/plain, Size: 1746 bytes --]

On Tue, Jan 25, 2022 at 06:12:48PM +0000, Jag Raman wrote:
> 
> 
> > On Jan 25, 2022, at 5:32 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Jan 19, 2022 at 04:41:55PM -0500, Jagannathan Raman wrote:
> >> Allow hotplugging of PCI(e) devices to remote machine
> >> 
> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >> ---
> >> hw/remote/machine.c | 29 +++++++++++++++++++++++++++++
> >> 1 file changed, 29 insertions(+)
> > 
> > Why is this code necessary? I expected the default hotplug behavior to
> 
> I just discovered that TYPE_REMOTE_MACHINE wasn't setting up a hotplug
> handler for the root PCI bus.
> 
> Looks like, some of the machines don’t support hotplugging PCI devices. I see
> that the ‘pc’ machine does support hotplug, whereas ‘q35’ does not.

Hotplug is definitely possible with q35. I'm not familiar with the
hotplug code though so I don't know how exactly that works for q35.

> We didn’t check hotplug in multiprocess-qemu previously because it was limited
> to one device per process, and the use cases attached the devices via
> command line.
> 
> > pretty much handle this case - hotplugging device types that the bus
> > doesn't support should fail and unplug should already unparent/unrealize
> > the device.
> 
> OK, that makes sense. We don’t need to test the device type during
> plug and unplug.
> 
> Therefore, I don’t think we need a callback for the plug operation. We
> could set HotplugHandlerClass->unplug callback to the default
> qdev_simple_device_unplug_cb() callback.

Great!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 07/18] vfio-user: set qdev bus callbacks for remote machine
  2022-01-25 21:12     ` Jag Raman
@ 2022-01-26  9:37       ` Stefan Hajnoczi
  2022-01-26 15:51         ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26  9:37 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, Beraldo Leal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert

[-- Attachment #1: Type: text/plain, Size: 3875 bytes --]

On Tue, Jan 25, 2022 at 09:12:28PM +0000, Jag Raman wrote:
> 
> 
> > On Jan 25, 2022, at 5:44 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Jan 19, 2022 at 04:41:56PM -0500, Jagannathan Raman wrote:
> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >> ---
> >> hw/remote/machine.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 57 insertions(+)
> >> 
> >> diff --git a/hw/remote/machine.c b/hw/remote/machine.c
> >> index 220ff01aa9..221a8430c1 100644
> >> --- a/hw/remote/machine.c
> >> +++ b/hw/remote/machine.c
> >> @@ -22,6 +22,60 @@
> >> #include "hw/pci/pci_host.h"
> >> #include "hw/remote/iohub.h"
> >> 
> >> +static bool remote_machine_get_bus(const char *type, BusState **bus,
> >> +                                   Error **errp)
> >> +{
> >> +    ERRP_GUARD();
> >> +    RemoteMachineState *s = REMOTE_MACHINE(current_machine);
> >> +    BusState *root_bus = NULL;
> >> +    PCIBus *new_pci_bus = NULL;
> >> +
> >> +    if (!bus) {
> >> +        error_setg(errp, "Invalid argument");
> >> +        return false;
> >> +    }
> >> +
> >> +    if (strcmp(type, TYPE_PCI_BUS) && strcmp(type, TYPE_PCI_BUS)) {
> >> +        return true;
> >> +    }
> >> +
> >> +    root_bus = qbus_find_recursive(sysbus_get_default(), NULL, TYPE_PCIE_BUS);
> >> +    if (!root_bus) {
> >> +        error_setg(errp, "Unable to find root PCI device");
> >> +        return false;
> >> +    }
> >> +
> >> +    new_pci_bus = pci_isol_bus_new(root_bus, type, errp);
> >> +    if (!new_pci_bus) {
> >> +        return false;
> >> +    }
> >> +
> >> +    *bus = BUS(new_pci_bus);
> >> +
> >> +    pci_bus_irqs(new_pci_bus, remote_iohub_set_irq, remote_iohub_map_irq,
> >> +                 &s->iohub, REMOTE_IOHUB_NB_PIRQS);
> >> +
> >> +    return true;
> >> +}
> > 
> > Can the user create the same PCI bus via QMP commands? If so, then this
> 
> I think there is a way we could achieve it.
> 
> When I looked around, both the command line and the QMP didn’t have a direct
> way to create a bus. However, there are some indirect ways. For example, the
> TYPE_LSI53C895A device creates a SCSI bus to attach SCSI devices. Similarly,
> there are some special PCI devices like TYPE_PCI_BRIDGE which create a
> secondary PCI bus.
> 
> Similarly, we could implement a PCI device that creates a PCI bus with
> isolated address spaces.

Exactly. device_add instantiates DeviceStates, not busses, so there
needs to be a parent device like a SCSI controller, a PCI bridge, etc
that owns and creates the bus.

> > is just a convenience that saves the extra step. Or is there some magic
> > that cannot be done via QMP device_add?
> > 
> > I'm asking because there are 3 objects involved and I'd like to
> > understand the lifecycle/dependencies:
> > 1. The PCIDevice we wish to export.
> > 2. The PCIBus with isolated address spaces that contains the PCIDevice.
> > 3. The vfio-user server that exports a given PCIDevice.
> > 
> > Users can already create the PCIDevice via hotplug and the vfio-user
> > server via object-add. So if there's no magic they could also create the
> > PCI bus:
> > 1. device_add ...some PCI bus stuff here...,id=isol-pci-bus0
> > 2. device_add ...the PCIDevice...,bus=isol-pci-bus0,id=mydev
> > 3. object-add x-vfio-user-server,device=mydev
> 
> We are able to do 2 & 3 already. We could introduce a PCI device that
> creates an isolated PCI bus. That would cover step 1 outlined above.

I wonder if a new device is needed or whether it's possible to add an
isol_as=on|off (default: off) option to an existing PCI bridge/expander?
Hopefully most of the code is already there.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-26  5:27         ` Jag Raman
@ 2022-01-26  9:45           ` Stefan Hajnoczi
  2022-01-26 20:07             ` Dr. David Alan Gilbert
  2022-01-26 18:13           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26  9:45 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, john.levon, Michael S. Tsirkin, armbru, quintela,
	qemu-devel, Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini,
	Eric Blake, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]

On Wed, Jan 26, 2022 at 05:27:32AM +0000, Jag Raman wrote:
> 
> 
> > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Jag Raman (jag.raman@oracle.com) wrote:
> >> 
> >> 
> >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> >>>> niche usage.
> >>>> 
> >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> >>>> the same machine/server. This would cause address space collision as
> >>>> well as be a security vulnerability. Having separate address spaces for
> >>>> each PCI bus would solve this problem.
> >>> 
> >>> Fascinating, but I am not sure I understand. any examples?
> >> 
> >> Hi Michael!
> >> 
> >> multiprocess QEMU and vfio-user implement a client-server model to allow
> >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> >> to the kernel and runs VCPUs, could attach devices running in a server
> >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> >> perform DMA.
> > 
> > Do you ever have the opposite problem? i.e. when an emulated PCI device
> 
> That’s an interesting question.
> 
> > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > that the client can see.  What happens if two emulated devices need to
> > access each others emulated address space?
> 
> In this case, the kernel driver would map the destination’s chunk of internal RAM into
> the DMA space of the source device. Then the source device could write to that
> mapped address range, and the IOMMU should direct those writes to the
> destination device.
> 
> I would like to take a closer look at the IOMMU implementation on how to achieve
> this, and get back to you. I think the IOMMU would handle this. Could you please
> point me to the IOMMU implementation you have in mind?

I don't know if the current vfio-user client/server patches already
implement device-to-device DMA, but the functionality is supported by
the vfio-user protocol.

Basically: if the DMA regions lookup inside the vfio-user server fails,
fall back to VFIO_USER_DMA_READ/WRITE messages instead.
https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read

Here is the flow:
1. The vfio-user server with device A sends a DMA read to QEMU.
2. QEMU finds the MemoryRegion associated with the DMA address and sees
   it's a device.
   a. If it's emulated inside the QEMU process then the normal
      device emulation code kicks in.
   b. If it's another vfio-user PCI device then the vfio-user PCI proxy
      device forwards the DMA to the second vfio-user server's device B.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 00/18] vfio-user server in QEMU
  2022-01-26  5:04   ` Jag Raman
@ 2022-01-26  9:56     ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26  9:56 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, Beraldo Leal,
	john.levon, mst, peter.maydell, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, kwolf,
	eblake, dgilbert

[-- Attachment #1: Type: text/plain, Size: 4363 bytes --]

On Wed, Jan 26, 2022 at 05:04:58AM +0000, Jag Raman wrote:
> 
> 
> > On Jan 25, 2022, at 11:00 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > Hi Jag,
> > Thanks for this latest revision. The biggest outstanding question I have
> > is about the isolated address spaces design.
> 
> Thank you for taking the time to review the patches, Stefan!
> 
> > 
> > This patch series needs a PCIBus with its own Memory Space, I/O Space,
> > and interrupts. That way a single QEMU process can host vfio-user
> > servers that different VMs connect to. They all need isolated address
> > spaces so that mapping a BAR in Device A does not conflict with mapping
> > a BAR in Device B.
> > 
> > The current approach adds special code to hw/pci/pci.c so that custom
> > AddressSpace can be set up. The isolated PCIBus is an automatically
> > created PCIe root port that's a child of the machine's main PCI bus. On
> > one hand it's neat because QEMU's assumption that there is only one
> > root SysBus isn't violated. On the other hand it seems like a special
> > case hack for PCI and I'm not sure in what sense these PCIBusses are
> > really children of the machine's main PCI bus since they don't share or
> > interact in any way.
> 
> We are discussing the automatic creation part you just mentioned in
> the following email:
> [PATCH v5 07/18] vfio-user: set qdev bus callbacks for remote machine
> 
> I agree that automatic creation of a parent bus is not ideal - we could
> specify the parent bus as a separate option in the command-line or
> QMP. This change would avoid modification to hw/pci/pci.c - the new
> PCI bus could be created inplace during device creation/hotplug.
> 
> The following image gives an idea of the bus/device topology in the remote
> machine, as implemented in the current series. Each secondary bus and
> its children have isolated memory and IO spaces.
> https://gitlab.com/jraman/qemu/-/commit/2e2ebf004894075ad8044739b0b16ce875114c4c

Do isolated PCI busses have any relationship with their parent at all? I
think the parent plays a useful role in DMA/IOMMU, interrupts, or PCI
addressing. That leaves me wondering if a parent/child relationship is
an appropriate way to model this.

That said, this approach solves two problems:
1. There must be some parent for the new PCI bus.
2. qdev_find_recursive() and friends must be able to find the PCIDevice
   on the isolated bus.

> > Another approach that came to mind is to allow multiple root SysBusses.
> > Each vfio-user server would need its own SysBus and put a regular PCI
> > host onto that isolated SysBus without modifying hw/pci/pci.c with a
> > special case. The downside to this is that violating the single SysBus
> > assumption probably breaks monitor commands that rely on
> > qdev_find_recursive() and friends. It seems cleaner than adding isolated
> > address spaces to PCI specifically, but also raises the question if
> > multiple machine instances are needed (which would raise even more
> > questions).
> 
> Based on further discussion with Stefan, I got some clarity. We could consider one
> more option as well - somewhere in-between multiple root SysBuses and the topology
> discussed above (with secondary PCI buses). We could implement a
> TYPE_SYS_BUS_DEVICE that creates a root PCI bus with isolated memory ranges.
> Something along the lines in the following diagram:
> https://gitlab.com/jraman/qemu/-/commit/81f6a998278a2a795be0db7acdeb1caa2d6744fb
> 
> An example set of QMP commands to attach PCI devices would be:
> device_add pci-root-bus,id=rb1
> device_add <driver>,id=mydev,bus=rb1
> object-add x-vfio-user-server,device=mydev
> 
> where ‘pci-root-bus’ is a TYPE_SYS_BUS_DEVICE that creates its own root PCI bus.

If it's less code then that's an advantage but it still places unrelated
DMA/interrupt spaces onto the same SysBus and therefore requires
isolation. I think this alternative doesn't fundamentally fix the
design.

If multiple roots are possible then isolation doesn't need to be
implemented explicitly, it comes for free as part of the regular
qdev/qbus hierarchy. The devices would be isolated by the fact that they
live on different roots :).

I have never tried the multi-root approach though, so I'm not sure how
much work it is.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 18/18] vfio-user: avocado tests for vfio-user
  2022-01-26  4:25   ` Philippe Mathieu-Daudé via
@ 2022-01-26 15:12     ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-26 15:12 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Marc-André Lureau, Stefan Hajnoczi, thanos.makatos,
	Paolo Bonzini, eblake, dgilbert



> On Jan 25, 2022, at 11:25 PM, Philippe Mathieu-Daudé <f4bug@amsat.org> wrote:
> 
> Hi Jagannathan,
> 
> On 19/1/22 22:42, Jagannathan Raman wrote:
>> Avocado tests for libvfio-user in QEMU - tests startup,
>> hotplug and migration of the server object
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> ---
>>  MAINTAINERS                |   1 +
>>  tests/avocado/vfio-user.py | 225 +++++++++++++++++++++++++++++++++++++
>>  2 files changed, 226 insertions(+)
>>  create mode 100644 tests/avocado/vfio-user.py
> 
>> +class VfioUser(QemuSystemTest):
>> +    """
>> +    :avocado: tags=vfiouser
>> +    """
>> +    KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 '
>> +    timeout = 20
>> +
>> +    @staticmethod
>> +    def migration_finished(vm):
>> +        res = vm.command('query-migrate')
>> +        if 'status' in res:
>> +            return res['status'] in ('completed', 'failed')
> 
> Do we need to check for failed migration in do_test_migrate()?

OK, will do.

> 
>> +        else:
>> +            return False
> 
> [...]
> 
>> +    def launch_server_hotplug(self, socket):
>> +        server_vm = self.get_vm()
>> +        server_vm.add_args('-machine', 'x-remote')
>> +        server_vm.add_args('-nodefaults')
>> +        server_vm.add_args('-device', 'lsi53c895a,id=lsi1')
>> +        server_vm.launch()
>> +        server_vm.command('human-monitor-command',
>> +                          command_line='object_add x-vfio-user-server,'
> 
> Why not use qmp('object-add', ...) directly?

OK, will use qmp directly.

Thank you!
--
Jag

> 
>> +                                       'id=vfioobj,socket.type=unix,'
>> +                                       'socket.path='+socket+',device=lsi1')
>> +        return server_vm
> 
> Otherwise LGTM.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 05/18] qdev: unplug blocker for devices
  2022-01-26  9:32       ` Stefan Hajnoczi
@ 2022-01-26 15:13         ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-26 15:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 26, 2022, at 4:32 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Tue, Jan 25, 2022 at 02:43:33PM +0000, Jag Raman wrote:
>> 
>> 
>>> On Jan 25, 2022, at 5:27 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Jan 19, 2022 at 04:41:54PM -0500, Jagannathan Raman wrote:
>>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>>>> ---
>>>> include/hw/qdev-core.h |  5 +++++
>>>> softmmu/qdev-monitor.c | 35 +++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 40 insertions(+)
>>>> 
>>>> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
>>>> index eed2983072..67df5e0081 100644
>>>> --- a/include/hw/qdev-core.h
>>>> +++ b/include/hw/qdev-core.h
>>>> @@ -193,6 +193,7 @@ struct DeviceState {
>>>>    int instance_id_alias;
>>>>    int alias_required_for_version;
>>>>    ResettableState reset;
>>>> +    GSList *unplug_blockers;
>>>> };
>>>> 
>>>> struct DeviceListener {
>>>> @@ -433,6 +434,10 @@ typedef bool (QDevPutBusFunc)(BusState *bus, Error **errp);
>>>> bool qdev_set_bus_cbs(QDevGetBusFunc *get_bus, QDevPutBusFunc *put_bus,
>>>>                      Error **errp);
>>>> 
>>>> +int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp);
>>>> +void qdev_del_unplug_blocker(DeviceState *dev, Error *reason);
>>>> +bool qdev_unplug_blocked(DeviceState *dev, Error **errp);
>>>> +
>>>> /**
>>>> * GpioPolarity: Polarity of a GPIO line
>>>> *
>>>> diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
>>>> index 7306074019..1a169f89a2 100644
>>>> --- a/softmmu/qdev-monitor.c
>>>> +++ b/softmmu/qdev-monitor.c
>>>> @@ -978,10 +978,45 @@ void qmp_device_del(const char *id, Error **errp)
>>>>            return;
>>>>        }
>>>> 
>>>> +        if (qdev_unplug_blocked(dev, errp)) {
>>>> +            return;
>>>> +        }
>>>> +
>>>>        qdev_unplug(dev, errp);
>>>>    }
>>>> }
>>>> 
>>>> +int qdev_add_unplug_blocker(DeviceState *dev, Error *reason, Error **errp)
>>>> +{
>>>> +    ERRP_GUARD();
>>>> +
>>>> +    if (!migration_is_idle()) {
>>>> +        error_setg(errp, "migration is in progress");
>>>> +        return -EBUSY;
>>>> +    }
>>> 
>>> Why can this function not be called during migration?
>> 
>> Since ‘unplug_blockers' is a member of the device, I thought it wouldn’t be correct to
>> allow changes to the device's state during migration.
>> 
>> I did weigh the following reasons against adding this check:
>>  - unplug_blockers is not migrated to the destination anyway, so it doesn’t matter if
>>    it changes after migration starts
> 
> Yes.
> 
>>  - whichever code/object that needs to add the blocker could add it at the destination
>>    if needed
> 
> Yes.
> 
>> However, unlike qmp_device_add(), qmp_object_add() doesn’t reject during
>> migration. As such, an object could add a blocker for the device when migration is
>> in progress.
>> 
>> Would you prefer to throw a warning, or fully remove this test?
> 
> Adding an unplug blocker during migration seems safe to me. I would
> remove this test.

OK, will do.

Thank you!
--
Jag

> 
> Stefan


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine
  2022-01-26  9:35       ` Stefan Hajnoczi
@ 2022-01-26 15:20         ` Jag Raman
  2022-01-26 15:43           ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-01-26 15:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 26, 2022, at 4:35 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Tue, Jan 25, 2022 at 06:12:48PM +0000, Jag Raman wrote:
>> 
>> 
>>> On Jan 25, 2022, at 5:32 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Jan 19, 2022 at 04:41:55PM -0500, Jagannathan Raman wrote:
>>>> Allow hotplugging of PCI(e) devices to remote machine
>>>> 
>>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>>>> ---
>>>> hw/remote/machine.c | 29 +++++++++++++++++++++++++++++
>>>> 1 file changed, 29 insertions(+)
>>> 
>>> Why is this code necessary? I expected the default hotplug behavior to
>> 
>> I just discovered that TYPE_REMOTE_MACHINE wasn't setting up a hotplug
>> handler for the root PCI bus.
>> 
>> Looks like, some of the machines don’t support hotplugging PCI devices. I see
>> that the ‘pc’ machine does support hotplug, whereas ‘q35’ does not.
> 
> Hotplug is definitely possible with q35. I'm not familiar with the
> hotplug code though so I don't know how exactly that works for q35.

I was referring to the root PCI bus, other buses in Q35 probably support
hotplug. Please see error message below:

QEMU 6.2.50 monitor - type 'help' for more information
(qemu) device_add lsi53c895a,id=lsi2
Error: Bus 'pcie.0' does not support hotplugging

--
Jag

> 
>> We didn’t check hotplug in multiprocess-qemu previously because it was limited
>> to one device per process, and the use cases attached the devices via
>> command line.
>> 
>>> pretty much handle this case - hotplugging device types that the bus
>>> doesn't support should fail and unplug should already unparent/unrealize
>>> the device.
>> 
>> OK, that makes sense. We don’t need to test the device type during
>> plug and unplug.
>> 
>> Therefore, I don’t think we need a callback for the plug operation. We
>> could set HotplugHandlerClass->unplug callback to the default
>> qdev_simple_device_unplug_cb() callback.
> 
> Great!
> 
> Stefan


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine
  2022-01-26 15:20         ` Jag Raman
@ 2022-01-26 15:43           ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26 15:43 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, bleal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert

[-- Attachment #1: Type: text/plain, Size: 1653 bytes --]

On Wed, Jan 26, 2022 at 03:20:35PM +0000, Jag Raman wrote:
> 
> 
> > On Jan 26, 2022, at 4:35 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Tue, Jan 25, 2022 at 06:12:48PM +0000, Jag Raman wrote:
> >> 
> >> 
> >>> On Jan 25, 2022, at 5:32 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> 
> >>> On Wed, Jan 19, 2022 at 04:41:55PM -0500, Jagannathan Raman wrote:
> >>>> Allow hotplugging of PCI(e) devices to remote machine
> >>>> 
> >>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >>>> ---
> >>>> hw/remote/machine.c | 29 +++++++++++++++++++++++++++++
> >>>> 1 file changed, 29 insertions(+)
> >>> 
> >>> Why is this code necessary? I expected the default hotplug behavior to
> >> 
> >> I just discovered that TYPE_REMOTE_MACHINE wasn't setting up a hotplug
> >> handler for the root PCI bus.
> >> 
> >> Looks like, some of the machines don’t support hotplugging PCI devices. I see
> >> that the ‘pc’ machine does support hotplug, whereas ‘q35’ does not.
> > 
> > Hotplug is definitely possible with q35. I'm not familiar with the
> > hotplug code though so I don't know how exactly that works for q35.
> 
> I was referring to the root PCI bus, other buses in Q35 probably support
> hotplug. Please see error message below:
> 
> QEMU 6.2.50 monitor - type 'help' for more information
> (qemu) device_add lsi53c895a,id=lsi2
> Error: Bus 'pcie.0' does not support hotplugging

Yes, I think that's because it's PCIe and not PCI.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 07/18] vfio-user: set qdev bus callbacks for remote machine
  2022-01-26  9:37       ` Stefan Hajnoczi
@ 2022-01-26 15:51         ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-26 15:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, berrange, Beraldo Leal,
	john.levon, mst, armbru, quintela, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, thanos.makatos, Paolo Bonzini, eblake,
	dgilbert



> On Jan 26, 2022, at 4:37 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Tue, Jan 25, 2022 at 09:12:28PM +0000, Jag Raman wrote:
>> 
>> 
>>> On Jan 25, 2022, at 5:44 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Jan 19, 2022 at 04:41:56PM -0500, Jagannathan Raman wrote:
>>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>>>> ---
>>>> hw/remote/machine.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 57 insertions(+)
>>>> 
>>>> diff --git a/hw/remote/machine.c b/hw/remote/machine.c
>>>> index 220ff01aa9..221a8430c1 100644
>>>> --- a/hw/remote/machine.c
>>>> +++ b/hw/remote/machine.c
>>>> @@ -22,6 +22,60 @@
>>>> #include "hw/pci/pci_host.h"
>>>> #include "hw/remote/iohub.h"
>>>> 
>>>> +static bool remote_machine_get_bus(const char *type, BusState **bus,
>>>> +                                   Error **errp)
>>>> +{
>>>> +    ERRP_GUARD();
>>>> +    RemoteMachineState *s = REMOTE_MACHINE(current_machine);
>>>> +    BusState *root_bus = NULL;
>>>> +    PCIBus *new_pci_bus = NULL;
>>>> +
>>>> +    if (!bus) {
>>>> +        error_setg(errp, "Invalid argument");
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    if (strcmp(type, TYPE_PCI_BUS) && strcmp(type, TYPE_PCI_BUS)) {
>>>> +        return true;
>>>> +    }
>>>> +
>>>> +    root_bus = qbus_find_recursive(sysbus_get_default(), NULL, TYPE_PCIE_BUS);
>>>> +    if (!root_bus) {
>>>> +        error_setg(errp, "Unable to find root PCI device");
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    new_pci_bus = pci_isol_bus_new(root_bus, type, errp);
>>>> +    if (!new_pci_bus) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    *bus = BUS(new_pci_bus);
>>>> +
>>>> +    pci_bus_irqs(new_pci_bus, remote_iohub_set_irq, remote_iohub_map_irq,
>>>> +                 &s->iohub, REMOTE_IOHUB_NB_PIRQS);
>>>> +
>>>> +    return true;
>>>> +}
>>> 
>>> Can the user create the same PCI bus via QMP commands? If so, then this
>> 
>> I think there is a way we could achieve it.
>> 
>> When I looked around, both the command line and the QMP didn’t have a direct
>> way to create a bus. However, there are some indirect ways. For example, the
>> TYPE_LSI53C895A device creates a SCSI bus to attach SCSI devices. Similarly,
>> there are some special PCI devices like TYPE_PCI_BRIDGE which create a
>> secondary PCI bus.
>> 
>> Similarly, we could implement a PCI device that creates a PCI bus with
>> isolated address spaces.
> 
> Exactly. device_add instantiates DeviceStates, not busses, so there
> needs to be a parent device like a SCSI controller, a PCI bridge, etc
> that owns and creates the bus.
> 
>>> is just a convenience that saves the extra step. Or is there some magic
>>> that cannot be done via QMP device_add?
>>> 
>>> I'm asking because there are 3 objects involved and I'd like to
>>> understand the lifecycle/dependencies:
>>> 1. The PCIDevice we wish to export.
>>> 2. The PCIBus with isolated address spaces that contains the PCIDevice.
>>> 3. The vfio-user server that exports a given PCIDevice.
>>> 
>>> Users can already create the PCIDevice via hotplug and the vfio-user
>>> server via object-add. So if there's no magic they could also create the
>>> PCI bus:
>>> 1. device_add ...some PCI bus stuff here...,id=isol-pci-bus0
>>> 2. device_add ...the PCIDevice...,bus=isol-pci-bus0,id=mydev
>>> 3. object-add x-vfio-user-server,device=mydev
>> 
>> We are able to do 2 & 3 already. We could introduce a PCI device that
>> creates an isolated PCI bus. That would cover step 1 outlined above.
> 
> I wonder if a new device is needed or whether it's possible to add an
> isol_as=on|off (default: off) option to an existing PCI bridge/expander?
> Hopefully most of the code is already there.

OK, it makes sense to add isol_as=on|off (default: off) option to an existing
PCI bridge/expander. Will shortly confirm with you the device that seems
interesting.

--
Jag

> 
> Stefan


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-26  5:27         ` Jag Raman
  2022-01-26  9:45           ` Stefan Hajnoczi
@ 2022-01-26 18:13           ` Dr. David Alan Gilbert
  2022-01-27 17:43             ` Jag Raman
  1 sibling, 1 reply; 99+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-26 18:13 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, john.levon, Michael S. Tsirkin, armbru, quintela,
	Philippe Mathieu-Daudé,
	qemu-devel, Marc-André Lureau, Stefan Hajnoczi,
	thanos.makatos, Paolo Bonzini, Eric Blake

* Jag Raman (jag.raman@oracle.com) wrote:
> 
> 
> > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Jag Raman (jag.raman@oracle.com) wrote:
> >> 
> >> 
> >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> 
> >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> >>>> niche usage.
> >>>> 
> >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> >>>> the same machine/server. This would cause address space collision as
> >>>> well as be a security vulnerability. Having separate address spaces for
> >>>> each PCI bus would solve this problem.
> >>> 
> >>> Fascinating, but I am not sure I understand. any examples?
> >> 
> >> Hi Michael!
> >> 
> >> multiprocess QEMU and vfio-user implement a client-server model to allow
> >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> >> to the kernel and runs VCPUs, could attach devices running in a server
> >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> >> perform DMA.
> > 
> > Do you ever have the opposite problem? i.e. when an emulated PCI device
> 
> That’s an interesting question.
> 
> > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > that the client can see.  What happens if two emulated devices need to
> > access each others emulated address space?
> 
> In this case, the kernel driver would map the destination’s chunk of internal RAM into
> the DMA space of the source device. Then the source device could write to that
> mapped address range, and the IOMMU should direct those writes to the
> destination device.

Are all devices mappable like that?

> I would like to take a closer look at the IOMMU implementation on how to achieve
> this, and get back to you. I think the IOMMU would handle this. Could you please
> point me to the IOMMU implementation you have in mind?

I didn't have one in mind; I was just hitting a similar problem on
Virtiofs DAX.

Dave

> Thank you!
> --
> Jag
> 
> > 
> > Dave
> > 
> >> In the case where multiple clients attach devices that are running on the
> >> same server, we need to ensure that each devices has isolated memory
> >> ranges. This ensures that the memory space of one device is not visible
> >> to other devices in the same server.
> >> 
> >>> 
> >>> I also wonder whether this special type could be modelled like a special
> >>> kind of iommu internally.
> >> 
> >> Could you please provide some more details on the design?
> >> 
> >>> 
> >>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> >>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> >>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> >>>> ---
> >>>> include/hw/pci/pci.h     |  2 ++
> >>>> include/hw/pci/pci_bus.h | 17 +++++++++++++++++
> >>>> hw/pci/pci.c             | 17 +++++++++++++++++
> >>>> hw/pci/pci_bridge.c      |  5 +++++
> >>>> 4 files changed, 41 insertions(+)
> >>>> 
> >>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >>>> index 023abc0f79..9bb4472abc 100644
> >>>> --- a/include/hw/pci/pci.h
> >>>> +++ b/include/hw/pci/pci.h
> >>>> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
> >>>> int pci_device_load(PCIDevice *s, QEMUFile *f);
> >>>> MemoryRegion *pci_address_space(PCIDevice *dev);
> >>>> MemoryRegion *pci_address_space_io(PCIDevice *dev);
> >>>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
> >>>> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
> >>>> 
> >>>> /*
> >>>> * Should not normally be used by devices. For use by sPAPR target
> >>>> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
> >>>> index 347440d42c..d78258e79e 100644
> >>>> --- a/include/hw/pci/pci_bus.h
> >>>> +++ b/include/hw/pci/pci_bus.h
> >>>> @@ -39,9 +39,26 @@ struct PCIBus {
> >>>>    void *irq_opaque;
> >>>>    PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
> >>>>    PCIDevice *parent_dev;
> >>>> +
> >>>>    MemoryRegion *address_space_mem;
> >>>>    MemoryRegion *address_space_io;
> >>>> 
> >>>> +    /**
> >>>> +     * Isolated address spaces - these allow the PCI bus to be part
> >>>> +     * of an isolated address space as opposed to the global
> >>>> +     * address_space_memory & address_space_io.
> >>> 
> >>> Are you sure address_space_memory & address_space_io are
> >>> always global? even in the case of an iommu?
> >> 
> >> On the CPU side of the Root Complex, I believe address_space_memory
> >> & address_space_io are global.
> >> 
> >> In the vfio-user case, devices on the same machine (TYPE_REMOTE_MACHINE)
> >> could be attached to different clients VMs. Each client would have their own address
> >> space for their CPUs. With isolated address spaces, we ensure that the devices
> >> see the address space of the CPUs they’re attached to.
> >> 
> >> Not sure if it’s OK to share weblinks in this mailing list, please let me know if that’s
> >> not preferred. But I’m referring to the terminology used in the following block diagram:
> >> https://en.wikipedia.org/wiki/Root_complex#/media/File:Example_PCI_Express_Topology.svg
> >> 
> >>> 
> >>>> This allows the
> >>>> +     * bus to be attached to CPUs from different machines. The
> >>>> +     * following is not used used commonly.
> >>>> +     *
> >>>> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
> >>>> +     * VM clients,
> >>> 
> >>> what are VM clients?
> >> 
> >> It’s the client in the client - server model explained above.
> >> 
> >> Thank you!
> >> --
> >> Jag
> >> 
> >>> 
> >>>> as such it needs the PCI buses in the same machine
> >>>> +     * to be part of different CPU address spaces. The following is
> >>>> +     * useful in that scenario.
> >>>> +     *
> >>>> +     */
> >>>> +    AddressSpace *isol_as_mem;
> >>>> +    AddressSpace *isol_as_io;
> >>>> +
> >>>>    QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
> >>>>    QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
> >>>> 
> >>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>>> index 5d30f9ca60..d5f1c6c421 100644
> >>>> --- a/hw/pci/pci.c
> >>>> +++ b/hw/pci/pci.c
> >>>> @@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
> >>>>    bus->slot_reserved_mask = 0x0;
> >>>>    bus->address_space_mem = address_space_mem;
> >>>>    bus->address_space_io = address_space_io;
> >>>> +    bus->isol_as_mem = NULL;
> >>>> +    bus->isol_as_io = NULL;
> >>>>    bus->flags |= PCI_BUS_IS_ROOT;
> >>>> 
> >>>>    /* host bridge */
> >>>> @@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
> >>>>    return pci_get_bus(dev)->address_space_io;
> >>>> }
> >>>> 
> >>>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev)
> >>>> +{
> >>>> +    return pci_get_bus(dev)->isol_as_mem;
> >>>> +}
> >>>> +
> >>>> +AddressSpace *pci_isol_as_io(PCIDevice *dev)
> >>>> +{
> >>>> +    return pci_get_bus(dev)->isol_as_io;
> >>>> +}
> >>>> +
> >>>> static void pci_device_class_init(ObjectClass *klass, void *data)
> >>>> {
> >>>>    DeviceClass *k = DEVICE_CLASS(klass);
> >>>> @@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
> >>>> 
> >>>> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>>> {
> >>>> +    AddressSpace *iommu_as = NULL;
> >>>>    PCIBus *bus = pci_get_bus(dev);
> >>>>    PCIBus *iommu_bus = bus;
> >>>>    uint8_t devfn = dev->devfn;
> >>>> @@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>>>    if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
> >>>>        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> >>>>    }
> >>>> +    iommu_as = pci_isol_as_mem(dev);
> >>>> +    if (iommu_as) {
> >>>> +        return iommu_as;
> >>>> +    }
> >>>>    return &address_space_memory;
> >>>> }
> >>>> 
> >>>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> >>>> index da34c8ebcd..98366768d2 100644
> >>>> --- a/hw/pci/pci_bridge.c
> >>>> +++ b/hw/pci/pci_bridge.c
> >>>> @@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
> >>>>    sec_bus->address_space_io = &br->address_space_io;
> >>>>    memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
> >>>>                       4 * GiB);
> >>>> +
> >>>> +    /* This PCI bridge puts the sec_bus in its parent's address space */
> >>>> +    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
> >>>> +    sec_bus->isol_as_io = pci_isol_as_io(dev);
> >>>> +
> >>>>    br->windows = pci_bridge_region_init(br);
> >>>>    QLIST_INIT(&sec_bus->child);
> >>>>    QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
> >>>> -- 
> >>>> 2.20.1
> >> 
> > -- 
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-26  9:45           ` Stefan Hajnoczi
@ 2022-01-26 20:07             ` Dr. David Alan Gilbert
  2022-01-26 21:13               ` Michael S. Tsirkin
  0 siblings, 1 reply; 99+ messages in thread
From: Dr. David Alan Gilbert @ 2022-01-26 20:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, Jag Raman, Beraldo Leal,
	Michael S. Tsirkin, armbru, quintela, Philippe Mathieu-Daudé,
	qemu-devel, Marc-André Lureau, Daniel P. Berrangé,
	thanos.makatos, Paolo Bonzini, Eric Blake, john.levon

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Jan 26, 2022 at 05:27:32AM +0000, Jag Raman wrote:
> > 
> > 
> > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > 
> > > * Jag Raman (jag.raman@oracle.com) wrote:
> > >> 
> > >> 
> > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >>> 
> > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> > >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> > >>>> niche usage.
> > >>>> 
> > >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> > >>>> the same machine/server. This would cause address space collision as
> > >>>> well as be a security vulnerability. Having separate address spaces for
> > >>>> each PCI bus would solve this problem.
> > >>> 
> > >>> Fascinating, but I am not sure I understand. any examples?
> > >> 
> > >> Hi Michael!
> > >> 
> > >> multiprocess QEMU and vfio-user implement a client-server model to allow
> > >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> > >> to the kernel and runs VCPUs, could attach devices running in a server
> > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > >> perform DMA.
> > > 
> > > Do you ever have the opposite problem? i.e. when an emulated PCI device
> > 
> > That’s an interesting question.
> > 
> > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > that the client can see.  What happens if two emulated devices need to
> > > access each others emulated address space?
> > 
> > In this case, the kernel driver would map the destination’s chunk of internal RAM into
> > the DMA space of the source device. Then the source device could write to that
> > mapped address range, and the IOMMU should direct those writes to the
> > destination device.
> > 
> > I would like to take a closer look at the IOMMU implementation on how to achieve
> > this, and get back to you. I think the IOMMU would handle this. Could you please
> > point me to the IOMMU implementation you have in mind?
> 
> I don't know if the current vfio-user client/server patches already
> implement device-to-device DMA, but the functionality is supported by
> the vfio-user protocol.
> 
> Basically: if the DMA regions lookup inside the vfio-user server fails,
> fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> 
> Here is the flow:
> 1. The vfio-user server with device A sends a DMA read to QEMU.
> 2. QEMU finds the MemoryRegion associated with the DMA address and sees
>    it's a device.
>    a. If it's emulated inside the QEMU process then the normal
>       device emulation code kicks in.
>    b. If it's another vfio-user PCI device then the vfio-user PCI proxy
>       device forwards the DMA to the second vfio-user server's device B.

I'm starting to be curious if there's a way to persuade the guest kernel
to do it for us; in general is there a way to say to PCI devices that
they can only DMA to the host and not other PCI devices?  Or that the
address space of a given PCIe bus is non-overlapping with one of the
others?

Dave

> Stefan


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-26 20:07             ` Dr. David Alan Gilbert
@ 2022-01-26 21:13               ` Michael S. Tsirkin
  2022-01-27  8:30                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-01-26 21:13 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: eduardo, Elena Ufimtseva, John Johnson, Jag Raman, Beraldo Leal,
	quintela, armbru, john.levon, qemu-devel,
	Philippe Mathieu-Daudé,
	Marc-André Lureau, Daniel P. Berrangé,
	Stefan Hajnoczi, thanos.makatos, Paolo Bonzini, Eric Blake

On Wed, Jan 26, 2022 at 08:07:36PM +0000, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Wed, Jan 26, 2022 at 05:27:32AM +0000, Jag Raman wrote:
> > > 
> > > 
> > > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > > 
> > > > * Jag Raman (jag.raman@oracle.com) wrote:
> > > >> 
> > > >> 
> > > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >>> 
> > > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> > > >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> > > >>>> niche usage.
> > > >>>> 
> > > >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> > > >>>> the same machine/server. This would cause address space collision as
> > > >>>> well as be a security vulnerability. Having separate address spaces for
> > > >>>> each PCI bus would solve this problem.
> > > >>> 
> > > >>> Fascinating, but I am not sure I understand. any examples?
> > > >> 
> > > >> Hi Michael!
> > > >> 
> > > >> multiprocess QEMU and vfio-user implement a client-server model to allow
> > > >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> > > >> to the kernel and runs VCPUs, could attach devices running in a server
> > > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > > >> perform DMA.
> > > > 
> > > > Do you ever have the opposite problem? i.e. when an emulated PCI device
> > > 
> > > That’s an interesting question.
> > > 
> > > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > > that the client can see.  What happens if two emulated devices need to
> > > > access each others emulated address space?
> > > 
> > > In this case, the kernel driver would map the destination’s chunk of internal RAM into
> > > the DMA space of the source device. Then the source device could write to that
> > > mapped address range, and the IOMMU should direct those writes to the
> > > destination device.
> > > 
> > > I would like to take a closer look at the IOMMU implementation on how to achieve
> > > this, and get back to you. I think the IOMMU would handle this. Could you please
> > > point me to the IOMMU implementation you have in mind?
> > 
> > I don't know if the current vfio-user client/server patches already
> > implement device-to-device DMA, but the functionality is supported by
> > the vfio-user protocol.
> > 
> > Basically: if the DMA regions lookup inside the vfio-user server fails,
> > fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> > 
> > Here is the flow:
> > 1. The vfio-user server with device A sends a DMA read to QEMU.
> > 2. QEMU finds the MemoryRegion associated with the DMA address and sees
> >    it's a device.
> >    a. If it's emulated inside the QEMU process then the normal
> >       device emulation code kicks in.
> >    b. If it's another vfio-user PCI device then the vfio-user PCI proxy
> >       device forwards the DMA to the second vfio-user server's device B.
> 
> I'm starting to be curious if there's a way to persuade the guest kernel
> to do it for us; in general is there a way to say to PCI devices that
> they can only DMA to the host and not other PCI devices?


But of course - this is how e.g. VFIO protects host PCI devices from
each other when one of them is passed through to a VM.

>  Or that the
> address space of a given PCIe bus is non-overlapping with one of the
> others?



> Dave
> > Stefan
> 
> 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-26 21:13               ` Michael S. Tsirkin
@ 2022-01-27  8:30                 ` Stefan Hajnoczi
  2022-01-27 12:50                   ` Michael S. Tsirkin
  2022-01-27 21:22                   ` Alex Williamson
  0 siblings, 2 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-27  8:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eduardo, Elena Ufimtseva, John Johnson, Jag Raman, Beraldo Leal,
	quintela, qemu-devel, armbru, john.levon,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert, Marc-André Lureau,
	Daniel P. Berrangé,
	thanos.makatos, Paolo Bonzini, Eric Blake

[-- Attachment #1: Type: text/plain, Size: 4043 bytes --]

On Wed, Jan 26, 2022 at 04:13:33PM -0500, Michael S. Tsirkin wrote:
> On Wed, Jan 26, 2022 at 08:07:36PM +0000, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Wed, Jan 26, 2022 at 05:27:32AM +0000, Jag Raman wrote:
> > > > 
> > > > 
> > > > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > > > 
> > > > > * Jag Raman (jag.raman@oracle.com) wrote:
> > > > >> 
> > > > >> 
> > > > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >>> 
> > > > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> > > > >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> > > > >>>> niche usage.
> > > > >>>> 
> > > > >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> > > > >>>> the same machine/server. This would cause address space collision as
> > > > >>>> well as be a security vulnerability. Having separate address spaces for
> > > > >>>> each PCI bus would solve this problem.
> > > > >>> 
> > > > >>> Fascinating, but I am not sure I understand. any examples?
> > > > >> 
> > > > >> Hi Michael!
> > > > >> 
> > > > >> multiprocess QEMU and vfio-user implement a client-server model to allow
> > > > >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> > > > >> to the kernel and runs VCPUs, could attach devices running in a server
> > > > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > > > >> perform DMA.
> > > > > 
> > > > > Do you ever have the opposite problem? i.e. when an emulated PCI device
> > > > 
> > > > That’s an interesting question.
> > > > 
> > > > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > > > that the client can see.  What happens if two emulated devices need to
> > > > > access each others emulated address space?
> > > > 
> > > > In this case, the kernel driver would map the destination’s chunk of internal RAM into
> > > > the DMA space of the source device. Then the source device could write to that
> > > > mapped address range, and the IOMMU should direct those writes to the
> > > > destination device.
> > > > 
> > > > I would like to take a closer look at the IOMMU implementation on how to achieve
> > > > this, and get back to you. I think the IOMMU would handle this. Could you please
> > > > point me to the IOMMU implementation you have in mind?
> > > 
> > > I don't know if the current vfio-user client/server patches already
> > > implement device-to-device DMA, but the functionality is supported by
> > > the vfio-user protocol.
> > > 
> > > Basically: if the DMA regions lookup inside the vfio-user server fails,
> > > fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> > > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> > > 
> > > Here is the flow:
> > > 1. The vfio-user server with device A sends a DMA read to QEMU.
> > > 2. QEMU finds the MemoryRegion associated with the DMA address and sees
> > >    it's a device.
> > >    a. If it's emulated inside the QEMU process then the normal
> > >       device emulation code kicks in.
> > >    b. If it's another vfio-user PCI device then the vfio-user PCI proxy
> > >       device forwards the DMA to the second vfio-user server's device B.
> > 
> > I'm starting to be curious if there's a way to persuade the guest kernel
> > to do it for us; in general is there a way to say to PCI devices that
> > they can only DMA to the host and not other PCI devices?
> 
> 
> But of course - this is how e.g. VFIO protects host PCI devices from
> each other when one of them is passed through to a VM.

Michael: Are you saying just turn on vIOMMU? :)

Devices in different VFIO groups have their own IOMMU context, so their
IOVA space is isolated. Just don't map other devices into the IOVA space
and those other devices will be inaccessible.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-27  8:30                 ` Stefan Hajnoczi
@ 2022-01-27 12:50                   ` Michael S. Tsirkin
  2022-01-27 21:22                   ` Alex Williamson
  1 sibling, 0 replies; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-01-27 12:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, Jag Raman, Beraldo Leal,
	quintela, qemu-devel, armbru, john.levon,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert, Marc-André Lureau,
	Daniel P. Berrangé,
	thanos.makatos, Paolo Bonzini, Eric Blake

On Thu, Jan 27, 2022 at 08:30:13AM +0000, Stefan Hajnoczi wrote:
> On Wed, Jan 26, 2022 at 04:13:33PM -0500, Michael S. Tsirkin wrote:
> > On Wed, Jan 26, 2022 at 08:07:36PM +0000, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > On Wed, Jan 26, 2022 at 05:27:32AM +0000, Jag Raman wrote:
> > > > > 
> > > > > 
> > > > > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > > > > 
> > > > > > * Jag Raman (jag.raman@oracle.com) wrote:
> > > > > >> 
> > > > > >> 
> > > > > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >>> 
> > > > > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> > > > > >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> > > > > >>>> niche usage.
> > > > > >>>> 
> > > > > >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> > > > > >>>> the same machine/server. This would cause address space collision as
> > > > > >>>> well as be a security vulnerability. Having separate address spaces for
> > > > > >>>> each PCI bus would solve this problem.
> > > > > >>> 
> > > > > >>> Fascinating, but I am not sure I understand. any examples?
> > > > > >> 
> > > > > >> Hi Michael!
> > > > > >> 
> > > > > >> multiprocess QEMU and vfio-user implement a client-server model to allow
> > > > > >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> > > > > >> to the kernel and runs VCPUs, could attach devices running in a server
> > > > > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > > > > >> perform DMA.
> > > > > > 
> > > > > > Do you ever have the opposite problem? i.e. when an emulated PCI device
> > > > > 
> > > > > That’s an interesting question.
> > > > > 
> > > > > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > > > > that the client can see.  What happens if two emulated devices need to
> > > > > > access each others emulated address space?
> > > > > 
> > > > > In this case, the kernel driver would map the destination’s chunk of internal RAM into
> > > > > the DMA space of the source device. Then the source device could write to that
> > > > > mapped address range, and the IOMMU should direct those writes to the
> > > > > destination device.
> > > > > 
> > > > > I would like to take a closer look at the IOMMU implementation on how to achieve
> > > > > this, and get back to you. I think the IOMMU would handle this. Could you please
> > > > > point me to the IOMMU implementation you have in mind?
> > > > 
> > > > I don't know if the current vfio-user client/server patches already
> > > > implement device-to-device DMA, but the functionality is supported by
> > > > the vfio-user protocol.
> > > > 
> > > > Basically: if the DMA regions lookup inside the vfio-user server fails,
> > > > fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> > > > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> > > > 
> > > > Here is the flow:
> > > > 1. The vfio-user server with device A sends a DMA read to QEMU.
> > > > 2. QEMU finds the MemoryRegion associated with the DMA address and sees
> > > >    it's a device.
> > > >    a. If it's emulated inside the QEMU process then the normal
> > > >       device emulation code kicks in.
> > > >    b. If it's another vfio-user PCI device then the vfio-user PCI proxy
> > > >       device forwards the DMA to the second vfio-user server's device B.
> > > 
> > > I'm starting to be curious if there's a way to persuade the guest kernel
> > > to do it for us; in general is there a way to say to PCI devices that
> > > they can only DMA to the host and not other PCI devices?
> > 
> > 
> > But of course - this is how e.g. VFIO protects host PCI devices from
> > each other when one of them is passed through to a VM.
> 
> Michael: Are you saying just turn on vIOMMU? :)
> 
> Devices in different VFIO groups have their own IOMMU context, so their
> IOVA space is isolated. Just don't map other devices into the IOVA space
> and those other devices will be inaccessible.
> 
> Stefan

I was wondering about it here:
https://lore.kernel.org/r/20220119190742-mutt-send-email-mst%40kernel.org

-- 
MST



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
  2022-01-25 15:48   ` Stefan Hajnoczi
@ 2022-01-27 17:04     ` Jag Raman
  2022-01-28  8:29       ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-01-27 17:04 UTC (permalink / raw)
  To: Stefan Hajnoczi, John Levon, Thanos Makatos
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, Michael S. Tsirkin, Markus Armbruster,
	Juan Quintela, qemu-devel, Philippe Mathieu-Daudé,
	Marc-André Lureau, Paolo Bonzini, Eric Blake,
	Dr . David Alan Gilbert



> On Jan 25, 2022, at 10:48 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote:
>> +     * The client subsequetly asks the remote server for any data that
> 
> subsequently
> 
>> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
>> +{
>> +    VfuObject *o = vfu_get_private(vfu_ctx);
>> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
>> +    static int migrated_devs;
>> +    Error *local_err = NULL;
>> +    int ret;
>> +
>> +    /**
>> +     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
>> +     * VMSD data from source is not available at RESUME state.
>> +     * Working on a fix for this.
>> +     */
>> +    if (!o->vfu_mig_file) {
>> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
>> +    }
>> +
>> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
>> +    if (ret) {
>> +        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
>> +        return;
>> +    }
>> +
>> +    qemu_file_shutdown(o->vfu_mig_file);
>> +    o->vfu_mig_file = NULL;
>> +
>> +    /* VFU_MIGR_STATE_RUNNING begins here */
>> +    if (++migrated_devs == k->nr_devs) {
> 
> When is this counter reset so migration can be tried again if it
> fails/cancels?

Detecting cancellation is a pending item. We will address it in the
next rev. Will check with you if  we get stuck during the process
of implementing it.

> 
>> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
>> +                                 uint64_t size, uint64_t offset)
>> +{
>> +    VfuObject *o = vfu_get_private(vfu_ctx);
>> +
>> +    if (offset > o->vfu_mig_buf_size) {
>> +        return -1;
>> +    }
>> +
>> +    if ((offset + size) > o->vfu_mig_buf_size) {
>> +        warn_report("vfu: buffer overflow - check pending_bytes");
>> +        size = o->vfu_mig_buf_size - offset;
>> +    }
>> +
>> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
>> +
>> +    o->vfu_mig_buf_pending -= size;
> 
> This assumes that the caller increments offset by size each time. If
> that assumption is okay, then we can just trust offset and don't need to
> do arithmetic on vfu_mig_buf_pending. If that assumption is not correct,
> then the code needs to be extended to safely update vfu_mig_buf_pending
> when offset jumps around arbitrarily between calls.

Going by the definition of vfu_migration_callbacks_t in the library, I assumed
that read_data advances the offset by size bytes.

Will add a comment a comment to explain that.

> 
>> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
>> +{
>> +    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
>> +    const VMStateField *field = NULL;
>> +    uint64_t size = 0;
>> +
>> +    if (!dc->vmsd) {
>> +        return 0;
>> +    }
>> +
>> +    field = dc->vmsd->fields;
>> +    while (field && field->name) {
>> +        size += vmstate_size(pci_dev, field);
>> +        field++;
>> +    }
>> +
>> +    return size;
>> +}
> 
> This function looks incorrect because it ignores subsections as well as
> runtime behavior during save(). Although VMStateDescription is partially
> declarative, there is still a bunch of imperative code that can write to
> the QEMUFile at save() time so there's no way of knowing the size ahead
> of time.

I see your point, it would be a problem for any field which has the
(VMS_BUFFER | VMS_ALLOC) flags set.

> 
> I asked this in a previous revision of this series but I'm not sure if
> it was answered: is it really necessary to know the size of the vmstate?
> I thought the VFIO migration interface is designed to support
> streaming reads/writes. We could choose a fixed size like 64KB and
> stream the vmstate in 64KB chunks.

The library exposes the migration data to the client as a device BAR with
fixed size - the size of which is fixed at boot time, even when using
vfu_migration_callbacks_t callbacks.

I don’t believe the library supports streaming vmstate/migration-data - see
the following comment in migration_region_access() defined in the library:

* Does this mean that partial reads are not allowed?

Thanos or John,

    Could you please clarify this?

Stefan,
    We attempted to answer the migration cancellation and vmstate size
    questions previously also, in the following email:

https://lore.kernel.org/all/F48606B1-15A4-4DD2-9D71-2FCAFC0E671F@oracle.com/

Thank you very much!
--
Jag


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-26 18:13           ` Dr. David Alan Gilbert
@ 2022-01-27 17:43             ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-01-27 17:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, john.levon, Michael S. Tsirkin, armbru, quintela,
	Philippe Mathieu-Daudé,
	qemu-devel, Marc-André Lureau, Stefan Hajnoczi,
	thanos.makatos, Paolo Bonzini, Eric Blake



> On Jan 26, 2022, at 1:13 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Jag Raman (jag.raman@oracle.com) wrote:
>> 
>> 
>>> On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Jag Raman (jag.raman@oracle.com) wrote:
>>>> 
>>>> 
>>>>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> 
>>>>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
>>>>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
>>>>>> niche usage.
>>>>>> 
>>>>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
>>>>>> the same machine/server. This would cause address space collision as
>>>>>> well as be a security vulnerability. Having separate address spaces for
>>>>>> each PCI bus would solve this problem.
>>>>> 
>>>>> Fascinating, but I am not sure I understand. any examples?
>>>> 
>>>> Hi Michael!
>>>> 
>>>> multiprocess QEMU and vfio-user implement a client-server model to allow
>>>> out-of-process emulation of devices. The client QEMU, which makes ioctls
>>>> to the kernel and runs VCPUs, could attach devices running in a server
>>>> QEMU. The server QEMU needs access to parts of the client’s RAM to
>>>> perform DMA.
>>> 
>>> Do you ever have the opposite problem? i.e. when an emulated PCI device
>> 
>> That’s an interesting question.
>> 
>>> exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
>>> that the client can see.  What happens if two emulated devices need to
>>> access each others emulated address space?
>> 
>> In this case, the kernel driver would map the destination’s chunk of internal RAM into
>> the DMA space of the source device. Then the source device could write to that
>> mapped address range, and the IOMMU should direct those writes to the
>> destination device.
> 
> Are all devices mappable like that?

If there is an IOMMU that supports DMA-Remapping (DMAR), that would be
possible - the kernel could configure the DMAR to facilitate such mapping.

If there is no DMAR, the kernel/cpu could buffer the write between devices.

--
Jag

> 
>> I would like to take a closer look at the IOMMU implementation on how to achieve
>> this, and get back to you. I think the IOMMU would handle this. Could you please
>> point me to the IOMMU implementation you have in mind?
> 
> I didn't have one in mind; I was just hitting a similar problem on
> Virtiofs DAX.
> 
> Dave
> 
>> Thank you!
>> --
>> Jag
>> 
>>> 
>>> Dave
>>> 
>>>> In the case where multiple clients attach devices that are running on the
>>>> same server, we need to ensure that each devices has isolated memory
>>>> ranges. This ensures that the memory space of one device is not visible
>>>> to other devices in the same server.
>>>> 
>>>>> 
>>>>> I also wonder whether this special type could be modelled like a special
>>>>> kind of iommu internally.
>>>> 
>>>> Could you please provide some more details on the design?
>>>> 
>>>>> 
>>>>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>>>>>> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
>>>>>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>>>>>> ---
>>>>>> include/hw/pci/pci.h     |  2 ++
>>>>>> include/hw/pci/pci_bus.h | 17 +++++++++++++++++
>>>>>> hw/pci/pci.c             | 17 +++++++++++++++++
>>>>>> hw/pci/pci_bridge.c      |  5 +++++
>>>>>> 4 files changed, 41 insertions(+)
>>>>>> 
>>>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>>>> index 023abc0f79..9bb4472abc 100644
>>>>>> --- a/include/hw/pci/pci.h
>>>>>> +++ b/include/hw/pci/pci.h
>>>>>> @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
>>>>>> int pci_device_load(PCIDevice *s, QEMUFile *f);
>>>>>> MemoryRegion *pci_address_space(PCIDevice *dev);
>>>>>> MemoryRegion *pci_address_space_io(PCIDevice *dev);
>>>>>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
>>>>>> +AddressSpace *pci_isol_as_io(PCIDevice *dev);
>>>>>> 
>>>>>> /*
>>>>>> * Should not normally be used by devices. For use by sPAPR target
>>>>>> diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
>>>>>> index 347440d42c..d78258e79e 100644
>>>>>> --- a/include/hw/pci/pci_bus.h
>>>>>> +++ b/include/hw/pci/pci_bus.h
>>>>>> @@ -39,9 +39,26 @@ struct PCIBus {
>>>>>>   void *irq_opaque;
>>>>>>   PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
>>>>>>   PCIDevice *parent_dev;
>>>>>> +
>>>>>>   MemoryRegion *address_space_mem;
>>>>>>   MemoryRegion *address_space_io;
>>>>>> 
>>>>>> +    /**
>>>>>> +     * Isolated address spaces - these allow the PCI bus to be part
>>>>>> +     * of an isolated address space as opposed to the global
>>>>>> +     * address_space_memory & address_space_io.
>>>>> 
>>>>> Are you sure address_space_memory & address_space_io are
>>>>> always global? even in the case of an iommu?
>>>> 
>>>> On the CPU side of the Root Complex, I believe address_space_memory
>>>> & address_space_io are global.
>>>> 
>>>> In the vfio-user case, devices on the same machine (TYPE_REMOTE_MACHINE)
>>>> could be attached to different clients VMs. Each client would have their own address
>>>> space for their CPUs. With isolated address spaces, we ensure that the devices
>>>> see the address space of the CPUs they’re attached to.
>>>> 
>>>> Not sure if it’s OK to share weblinks in this mailing list, please let me know if that’s
>>>> not preferred. But I’m referring to the terminology used in the following block diagram:
>>>> https://en.wikipedia.org/wiki/Root_complex#/media/File:Example_PCI_Express_Topology.svg
>>>> 
>>>>> 
>>>>>> This allows the
>>>>>> +     * bus to be attached to CPUs from different machines. The
>>>>>> +     * following is not used used commonly.
>>>>>> +     *
>>>>>> +     * TYPE_REMOTE_MACHINE allows emulating devices from multiple
>>>>>> +     * VM clients,
>>>>> 
>>>>> what are VM clients?
>>>> 
>>>> It’s the client in the client - server model explained above.
>>>> 
>>>> Thank you!
>>>> --
>>>> Jag
>>>> 
>>>>> 
>>>>>> as such it needs the PCI buses in the same machine
>>>>>> +     * to be part of different CPU address spaces. The following is
>>>>>> +     * useful in that scenario.
>>>>>> +     *
>>>>>> +     */
>>>>>> +    AddressSpace *isol_as_mem;
>>>>>> +    AddressSpace *isol_as_io;
>>>>>> +
>>>>>>   QLIST_HEAD(, PCIBus) child; /* this will be replaced by qdev later */
>>>>>>   QLIST_ENTRY(PCIBus) sibling;/* this will be replaced by qdev later */
>>>>>> 
>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>> index 5d30f9ca60..d5f1c6c421 100644
>>>>>> --- a/hw/pci/pci.c
>>>>>> +++ b/hw/pci/pci.c
>>>>>> @@ -442,6 +442,8 @@ static void pci_root_bus_internal_init(PCIBus *bus, DeviceState *parent,
>>>>>>   bus->slot_reserved_mask = 0x0;
>>>>>>   bus->address_space_mem = address_space_mem;
>>>>>>   bus->address_space_io = address_space_io;
>>>>>> +    bus->isol_as_mem = NULL;
>>>>>> +    bus->isol_as_io = NULL;
>>>>>>   bus->flags |= PCI_BUS_IS_ROOT;
>>>>>> 
>>>>>>   /* host bridge */
>>>>>> @@ -2676,6 +2678,16 @@ MemoryRegion *pci_address_space_io(PCIDevice *dev)
>>>>>>   return pci_get_bus(dev)->address_space_io;
>>>>>> }
>>>>>> 
>>>>>> +AddressSpace *pci_isol_as_mem(PCIDevice *dev)
>>>>>> +{
>>>>>> +    return pci_get_bus(dev)->isol_as_mem;
>>>>>> +}
>>>>>> +
>>>>>> +AddressSpace *pci_isol_as_io(PCIDevice *dev)
>>>>>> +{
>>>>>> +    return pci_get_bus(dev)->isol_as_io;
>>>>>> +}
>>>>>> +
>>>>>> static void pci_device_class_init(ObjectClass *klass, void *data)
>>>>>> {
>>>>>>   DeviceClass *k = DEVICE_CLASS(klass);
>>>>>> @@ -2699,6 +2711,7 @@ static void pci_device_class_base_init(ObjectClass *klass, void *data)
>>>>>> 
>>>>>> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>> {
>>>>>> +    AddressSpace *iommu_as = NULL;
>>>>>>   PCIBus *bus = pci_get_bus(dev);
>>>>>>   PCIBus *iommu_bus = bus;
>>>>>>   uint8_t devfn = dev->devfn;
>>>>>> @@ -2745,6 +2758,10 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>>>>>   if (!pci_bus_bypass_iommu(bus) && iommu_bus && iommu_bus->iommu_fn) {
>>>>>>       return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>>>>>>   }
>>>>>> +    iommu_as = pci_isol_as_mem(dev);
>>>>>> +    if (iommu_as) {
>>>>>> +        return iommu_as;
>>>>>> +    }
>>>>>>   return &address_space_memory;
>>>>>> }
>>>>>> 
>>>>>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
>>>>>> index da34c8ebcd..98366768d2 100644
>>>>>> --- a/hw/pci/pci_bridge.c
>>>>>> +++ b/hw/pci/pci_bridge.c
>>>>>> @@ -383,6 +383,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char *typename)
>>>>>>   sec_bus->address_space_io = &br->address_space_io;
>>>>>>   memory_region_init(&br->address_space_io, OBJECT(br), "pci_bridge_io",
>>>>>>                      4 * GiB);
>>>>>> +
>>>>>> +    /* This PCI bridge puts the sec_bus in its parent's address space */
>>>>>> +    sec_bus->isol_as_mem = pci_isol_as_mem(dev);
>>>>>> +    sec_bus->isol_as_io = pci_isol_as_io(dev);
>>>>>> +
>>>>>>   br->windows = pci_bridge_region_init(br);
>>>>>>   QLIST_INIT(&sec_bus->child);
>>>>>>   QLIST_INSERT_HEAD(&parent->child, sec_bus, sibling);
>>>>>> -- 
>>>>>> 2.20.1
>>>> 
>>> -- 
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>> 
>> 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-27  8:30                 ` Stefan Hajnoczi
  2022-01-27 12:50                   ` Michael S. Tsirkin
@ 2022-01-27 21:22                   ` Alex Williamson
  2022-01-28  8:19                     ` Stefan Hajnoczi
                                       ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: Alex Williamson @ 2022-01-27 21:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	Michael S. Tsirkin, qemu-devel, armbru, quintela, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, thanos.makatos,
	Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

On Thu, 27 Jan 2022 08:30:13 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Wed, Jan 26, 2022 at 04:13:33PM -0500, Michael S. Tsirkin wrote:
> > On Wed, Jan 26, 2022 at 08:07:36PM +0000, Dr. David Alan Gilbert wrote:  
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:  
> > > > On Wed, Jan 26, 2022 at 05:27:32AM +0000, Jag Raman wrote:  
> > > > > 
> > > > >   
> > > > > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > > > > 
> > > > > > * Jag Raman (jag.raman@oracle.com) wrote:  
> > > > > >> 
> > > > > >>   
> > > > > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >>> 
> > > > > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:  
> > > > > >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> > > > > >>>> niche usage.
> > > > > >>>> 
> > > > > >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> > > > > >>>> the same machine/server. This would cause address space collision as
> > > > > >>>> well as be a security vulnerability. Having separate address spaces for
> > > > > >>>> each PCI bus would solve this problem.  
> > > > > >>> 
> > > > > >>> Fascinating, but I am not sure I understand. any examples?  
> > > > > >> 
> > > > > >> Hi Michael!
> > > > > >> 
> > > > > >> multiprocess QEMU and vfio-user implement a client-server model to allow
> > > > > >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> > > > > >> to the kernel and runs VCPUs, could attach devices running in a server
> > > > > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > > > > >> perform DMA.  
> > > > > > 
> > > > > > Do you ever have the opposite problem? i.e. when an emulated PCI device  
> > > > > 
> > > > > That’s an interesting question.
> > > > >   
> > > > > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > > > > that the client can see.  What happens if two emulated devices need to
> > > > > > access each others emulated address space?  
> > > > > 
> > > > > In this case, the kernel driver would map the destination’s chunk of internal RAM into
> > > > > the DMA space of the source device. Then the source device could write to that
> > > > > mapped address range, and the IOMMU should direct those writes to the
> > > > > destination device.
> > > > > 
> > > > > I would like to take a closer look at the IOMMU implementation on how to achieve
> > > > > this, and get back to you. I think the IOMMU would handle this. Could you please
> > > > > point me to the IOMMU implementation you have in mind?  
> > > > 
> > > > I don't know if the current vfio-user client/server patches already
> > > > implement device-to-device DMA, but the functionality is supported by
> > > > the vfio-user protocol.
> > > > 
> > > > Basically: if the DMA regions lookup inside the vfio-user server fails,
> > > > fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> > > > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> > > > 
> > > > Here is the flow:
> > > > 1. The vfio-user server with device A sends a DMA read to QEMU.
> > > > 2. QEMU finds the MemoryRegion associated with the DMA address and sees
> > > >    it's a device.
> > > >    a. If it's emulated inside the QEMU process then the normal
> > > >       device emulation code kicks in.
> > > >    b. If it's another vfio-user PCI device then the vfio-user PCI proxy
> > > >       device forwards the DMA to the second vfio-user server's device B.  
> > > 
> > > I'm starting to be curious if there's a way to persuade the guest kernel
> > > to do it for us; in general is there a way to say to PCI devices that
> > > they can only DMA to the host and not other PCI devices?  
> > 
> > 
> > But of course - this is how e.g. VFIO protects host PCI devices from
> > each other when one of them is passed through to a VM.  
> 
> Michael: Are you saying just turn on vIOMMU? :)
> 
> Devices in different VFIO groups have their own IOMMU context, so their
> IOVA space is isolated. Just don't map other devices into the IOVA space
> and those other devices will be inaccessible.

Devices in different VFIO *containers* have their own IOMMU context.
Based on the group attachment to a container, groups can either have
shared or isolated IOVA space.  That determination is made by looking
at the address space of the bus, which is governed by the presence of a
vIOMMU.

If the goal here is to restrict DMA between devices, ie. peer-to-peer
(p2p), why are we trying to re-invent what an IOMMU already does?  In
fact, it seems like an IOMMU does this better in providing an IOVA
address space per BDF.  Is the dynamic mapping overhead too much?  What
physical hardware properties or specifications could we leverage to
restrict p2p mappings to a device?  Should it be governed by machine
type to provide consistency between devices?  Should each "isolated"
bus be in a separate root complex?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-27 21:22                   ` Alex Williamson
@ 2022-01-28  8:19                     ` Stefan Hajnoczi
  2022-01-28  9:18                     ` Stefan Hajnoczi
  2022-02-01 10:42                     ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-28  8:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	Michael S. Tsirkin, qemu-devel, armbru, quintela, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, thanos.makatos,
	Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 4898 bytes --]

On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:
> On Thu, 27 Jan 2022 08:30:13 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Wed, Jan 26, 2022 at 04:13:33PM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Jan 26, 2022 at 08:07:36PM +0000, Dr. David Alan Gilbert wrote:  
> > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:  
> > > > > On Wed, Jan 26, 2022 at 05:27:32AM +0000, Jag Raman wrote:  
> > > > > > 
> > > > > >   
> > > > > > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > > > > > 
> > > > > > > * Jag Raman (jag.raman@oracle.com) wrote:  
> > > > > > >> 
> > > > > > >>   
> > > > > > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > >>> 
> > > > > > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:  
> > > > > > >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> > > > > > >>>> niche usage.
> > > > > > >>>> 
> > > > > > >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> > > > > > >>>> the same machine/server. This would cause address space collision as
> > > > > > >>>> well as be a security vulnerability. Having separate address spaces for
> > > > > > >>>> each PCI bus would solve this problem.  
> > > > > > >>> 
> > > > > > >>> Fascinating, but I am not sure I understand. any examples?  
> > > > > > >> 
> > > > > > >> Hi Michael!
> > > > > > >> 
> > > > > > >> multiprocess QEMU and vfio-user implement a client-server model to allow
> > > > > > >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> > > > > > >> to the kernel and runs VCPUs, could attach devices running in a server
> > > > > > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > > > > > >> perform DMA.  
> > > > > > > 
> > > > > > > Do you ever have the opposite problem? i.e. when an emulated PCI device  
> > > > > > 
> > > > > > That’s an interesting question.
> > > > > >   
> > > > > > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > > > > > that the client can see.  What happens if two emulated devices need to
> > > > > > > access each others emulated address space?  
> > > > > > 
> > > > > > In this case, the kernel driver would map the destination’s chunk of internal RAM into
> > > > > > the DMA space of the source device. Then the source device could write to that
> > > > > > mapped address range, and the IOMMU should direct those writes to the
> > > > > > destination device.
> > > > > > 
> > > > > > I would like to take a closer look at the IOMMU implementation on how to achieve
> > > > > > this, and get back to you. I think the IOMMU would handle this. Could you please
> > > > > > point me to the IOMMU implementation you have in mind?  
> > > > > 
> > > > > I don't know if the current vfio-user client/server patches already
> > > > > implement device-to-device DMA, but the functionality is supported by
> > > > > the vfio-user protocol.
> > > > > 
> > > > > Basically: if the DMA regions lookup inside the vfio-user server fails,
> > > > > fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> > > > > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> > > > > 
> > > > > Here is the flow:
> > > > > 1. The vfio-user server with device A sends a DMA read to QEMU.
> > > > > 2. QEMU finds the MemoryRegion associated with the DMA address and sees
> > > > >    it's a device.
> > > > >    a. If it's emulated inside the QEMU process then the normal
> > > > >       device emulation code kicks in.
> > > > >    b. If it's another vfio-user PCI device then the vfio-user PCI proxy
> > > > >       device forwards the DMA to the second vfio-user server's device B.  
> > > > 
> > > > I'm starting to be curious if there's a way to persuade the guest kernel
> > > > to do it for us; in general is there a way to say to PCI devices that
> > > > they can only DMA to the host and not other PCI devices?  
> > > 
> > > 
> > > But of course - this is how e.g. VFIO protects host PCI devices from
> > > each other when one of them is passed through to a VM.  
> > 
> > Michael: Are you saying just turn on vIOMMU? :)
> > 
> > Devices in different VFIO groups have their own IOMMU context, so their
> > IOVA space is isolated. Just don't map other devices into the IOVA space
> > and those other devices will be inaccessible.
> 
> Devices in different VFIO *containers* have their own IOMMU context.
> Based on the group attachment to a container, groups can either have
> shared or isolated IOVA space.  That determination is made by looking
> at the address space of the bus, which is governed by the presence of a
> vIOMMU.

Oops, thank you for pointing that out!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
  2022-01-27 17:04     ` Jag Raman
@ 2022-01-28  8:29       ` Stefan Hajnoczi
  2022-01-28 14:49         ` Thanos Makatos
  2022-02-01  3:49         ` Jag Raman
  0 siblings, 2 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-28  8:29 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, John Levon, Markus Armbruster, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	qemu-devel, Juan Quintela, Marc-André Lureau, Paolo Bonzini,
	Thanos Makatos, Eric Blake, Dr . David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 7537 bytes --]

On Thu, Jan 27, 2022 at 05:04:26PM +0000, Jag Raman wrote:
> 
> 
> > On Jan 25, 2022, at 10:48 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote:
> >> +     * The client subsequetly asks the remote server for any data that
> > 
> > subsequently
> > 
> >> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
> >> +{
> >> +    VfuObject *o = vfu_get_private(vfu_ctx);
> >> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
> >> +    static int migrated_devs;
> >> +    Error *local_err = NULL;
> >> +    int ret;
> >> +
> >> +    /**
> >> +     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
> >> +     * VMSD data from source is not available at RESUME state.
> >> +     * Working on a fix for this.
> >> +     */
> >> +    if (!o->vfu_mig_file) {
> >> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
> >> +    }
> >> +
> >> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
> >> +    if (ret) {
> >> +        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
> >> +        return;
> >> +    }
> >> +
> >> +    qemu_file_shutdown(o->vfu_mig_file);
> >> +    o->vfu_mig_file = NULL;
> >> +
> >> +    /* VFU_MIGR_STATE_RUNNING begins here */
> >> +    if (++migrated_devs == k->nr_devs) {
> > 
> > When is this counter reset so migration can be tried again if it
> > fails/cancels?
> 
> Detecting cancellation is a pending item. We will address it in the
> next rev. Will check with you if  we get stuck during the process
> of implementing it.
> 
> > 
> >> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
> >> +                                 uint64_t size, uint64_t offset)
> >> +{
> >> +    VfuObject *o = vfu_get_private(vfu_ctx);
> >> +
> >> +    if (offset > o->vfu_mig_buf_size) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    if ((offset + size) > o->vfu_mig_buf_size) {
> >> +        warn_report("vfu: buffer overflow - check pending_bytes");
> >> +        size = o->vfu_mig_buf_size - offset;
> >> +    }
> >> +
> >> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
> >> +
> >> +    o->vfu_mig_buf_pending -= size;
> > 
> > This assumes that the caller increments offset by size each time. If
> > that assumption is okay, then we can just trust offset and don't need to
> > do arithmetic on vfu_mig_buf_pending. If that assumption is not correct,
> > then the code needs to be extended to safely update vfu_mig_buf_pending
> > when offset jumps around arbitrarily between calls.
> 
> Going by the definition of vfu_migration_callbacks_t in the library, I assumed
> that read_data advances the offset by size bytes.
> 
> Will add a comment a comment to explain that.
> 
> > 
> >> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
> >> +{
> >> +    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
> >> +    const VMStateField *field = NULL;
> >> +    uint64_t size = 0;
> >> +
> >> +    if (!dc->vmsd) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    field = dc->vmsd->fields;
> >> +    while (field && field->name) {
> >> +        size += vmstate_size(pci_dev, field);
> >> +        field++;
> >> +    }
> >> +
> >> +    return size;
> >> +}
> > 
> > This function looks incorrect because it ignores subsections as well as
> > runtime behavior during save(). Although VMStateDescription is partially
> > declarative, there is still a bunch of imperative code that can write to
> > the QEMUFile at save() time so there's no way of knowing the size ahead
> > of time.
> 
> I see your point, it would be a problem for any field which has the
> (VMS_BUFFER | VMS_ALLOC) flags set.
> 
> > 
> > I asked this in a previous revision of this series but I'm not sure if
> > it was answered: is it really necessary to know the size of the vmstate?
> > I thought the VFIO migration interface is designed to support
> > streaming reads/writes. We could choose a fixed size like 64KB and
> > stream the vmstate in 64KB chunks.
> 
> The library exposes the migration data to the client as a device BAR with
> fixed size - the size of which is fixed at boot time, even when using
> vfu_migration_callbacks_t callbacks.
> 
> I don’t believe the library supports streaming vmstate/migration-data - see
> the following comment in migration_region_access() defined in the library:
> 
> * Does this mean that partial reads are not allowed?
> 
> Thanos or John,
> 
>     Could you please clarify this?
> 
> Stefan,
>     We attempted to answer the migration cancellation and vmstate size
>     questions previously also, in the following email:
> 
> https://lore.kernel.org/all/F48606B1-15A4-4DD2-9D71-2FCAFC0E671F@oracle.com/

>  libvfio-user has the vfu_migration_callbacks_t interface that allows the
>  device to save/load more data regardless of the size of the migration
>  region. I don't see the issue here since the region doesn't need to be
>  sized to fit the savevm data?

The answer didn't make sense to me:

"In both scenarios at the server end - whether using the migration BAR or
using callbacks, the migration data is transported to the other end using
the BAR. As such we need to specify the BAR’s size during initialization.

In the case of the callbacks, the library translates the BAR access to callbacks."

The BAR and the migration region within it need a size but my
understanding is that VFIO migration is designed to stream the device
state, allowing it to be broken up into multiple reads/writes with
knowing the device state's size upfront. Here is the description from
<linux/vfio.h>:

  * The sequence to be followed while in pre-copy state and stop-and-copy state
  * is as follows:
  * a. Read pending_bytes, indicating the start of a new iteration to get device
  *    data. Repeated read on pending_bytes at this stage should have no side
  *    effects.
  *    If pending_bytes == 0, the user application should not iterate to get data
  *    for that device.
  *    If pending_bytes > 0, perform the following steps.
  * b. Read data_offset, indicating that the vendor driver should make data
  *    available through the data section. The vendor driver should return this
  *    read operation only after data is available from (region + data_offset)
  *    to (region + data_offset + data_size).
  * c. Read data_size, which is the amount of data in bytes available through
  *    the migration region.
  *    Read on data_offset and data_size should return the offset and size of
  *    the current buffer if the user application reads data_offset and
  *    data_size more than once here.
  * d. Read data_size bytes of data from (region + data_offset) from the
  *    migration region.
  * e. Process the data.
  * f. Read pending_bytes, which indicates that the data from the previous
  *    iteration has been read. If pending_bytes > 0, go to step b.
  *
  * The user application can transition from the _SAVING|_RUNNING
  * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
  * number of pending bytes. The user application should iterate in _SAVING
  * (stop-and-copy) until pending_bytes is 0.

This means you can report pending_bytes > 0 until the entire vmstate has
been read and can pick a fixed chunk size like 64KB for the migration
region. There's no need to size the migration region to fit the entire
vmstate.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-27 21:22                   ` Alex Williamson
  2022-01-28  8:19                     ` Stefan Hajnoczi
@ 2022-01-28  9:18                     ` Stefan Hajnoczi
  2022-01-31 16:16                       ` Alex Williamson
  2022-02-01 10:42                     ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-01-28  9:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	Michael S. Tsirkin, qemu-devel, armbru, quintela, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, thanos.makatos,
	Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 1725 bytes --]

On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:
> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> (p2p), why are we trying to re-invent what an IOMMU already does?

The issue Dave raised is that vfio-user servers run in separate
processses from QEMU with shared memory access to RAM but no direct
access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
example of a non-RAM MemoryRegion that can be the source/target of DMA
requests.

I don't think IOMMUs solve this problem but luckily the vfio-user
protocol already has messages that vfio-user servers can use as a
fallback when DMA cannot be completed through the shared memory RAM
accesses.

> In
> fact, it seems like an IOMMU does this better in providing an IOVA
> address space per BDF.  Is the dynamic mapping overhead too much?  What
> physical hardware properties or specifications could we leverage to
> restrict p2p mappings to a device?  Should it be governed by machine
> type to provide consistency between devices?  Should each "isolated"
> bus be in a separate root complex?  Thanks,

There is a separate issue in this patch series regarding isolating the
address space where BAR accesses are made (i.e. the global
address_space_memory/io). When one process hosts multiple vfio-user
server instances (e.g. a software-defined network switch with multiple
ethernet devices) then each instance needs isolated memory and io address
spaces so that vfio-user clients don't cause collisions when they map
BARs to the same address.

I think the the separate root complex idea is a good solution. This
patch series takes a different approach by adding the concept of
isolated address spaces into hw/pci/.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
  2022-01-28  8:29       ` Stefan Hajnoczi
@ 2022-01-28 14:49         ` Thanos Makatos
  2022-02-01  3:49         ` Jag Raman
  1 sibling, 0 replies; 99+ messages in thread
From: Thanos Makatos @ 2022-01-28 14:49 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, John Levon, Markus Armbruster, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	qemu-devel, Juan Quintela, Marc-André Lureau, Paolo Bonzini,
	Eric Blake, Dr . David Alan Gilbert



> -----Original Message-----
> From: Stefan Hajnoczi <stefanha@redhat.com>
> Sent: 28 January 2022 08:29
> To: Jag Raman <jag.raman@oracle.com>
> Cc: John Levon <john.levon@nutanix.com>; Thanos Makatos
> <thanos.makatos@nutanix.com>; qemu-devel <qemu-devel@nongnu.org>;
> Marc-André Lureau <marcandre.lureau@gmail.com>; Philippe Mathieu-Daudé
> <f4bug@amsat.org>; Paolo Bonzini <pbonzini@redhat.com>; Beraldo Leal
> <bleal@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>;
> eduardo@habkost.net; Michael S. Tsirkin <mst@redhat.com>; Marcel
> Apfelbaum <marcel.apfelbaum@gmail.com>; Eric Blake <eblake@redhat.com>;
> Markus Armbruster <armbru@redhat.com>; Juan Quintela
> <quintela@redhat.com>; Dr . David Alan Gilbert <dgilbert@redhat.com>; Elena
> Ufimtseva <elena.ufimtseva@oracle.com>; John Johnson
> <john.g.johnson@oracle.com>
> Subject: Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
> 
> On Thu, Jan 27, 2022 at 05:04:26PM +0000, Jag Raman wrote:
> >
> >
> > > On Jan 25, 2022, at 10:48 AM, Stefan Hajnoczi <stefanha@redhat.com>
> wrote:
> > >
> > > On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote:
> > >> +     * The client subsequetly asks the remote server for any data that
> > >
> > > subsequently
> > >
> > >> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
> > >> +{
> > >> +    VfuObject *o = vfu_get_private(vfu_ctx);
> > >> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
> > >> +    static int migrated_devs;
> > >> +    Error *local_err = NULL;
> > >> +    int ret;
> > >> +
> > >> +    /**
> > >> +     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
> > >> +     * VMSD data from source is not available at RESUME state.
> > >> +     * Working on a fix for this.
> > >> +     */
> > >> +    if (!o->vfu_mig_file) {
> > >> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
> > >> +    }
> > >> +
> > >> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
> > >> +    if (ret) {
> > >> +        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    qemu_file_shutdown(o->vfu_mig_file);
> > >> +    o->vfu_mig_file = NULL;
> > >> +
> > >> +    /* VFU_MIGR_STATE_RUNNING begins here */
> > >> +    if (++migrated_devs == k->nr_devs) {
> > >
> > > When is this counter reset so migration can be tried again if it
> > > fails/cancels?
> >
> > Detecting cancellation is a pending item. We will address it in the
> > next rev. Will check with you if  we get stuck during the process
> > of implementing it.
> >
> > >
> > >> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
> > >> +                                 uint64_t size, uint64_t offset)
> > >> +{
> > >> +    VfuObject *o = vfu_get_private(vfu_ctx);
> > >> +
> > >> +    if (offset > o->vfu_mig_buf_size) {
> > >> +        return -1;
> > >> +    }
> > >> +
> > >> +    if ((offset + size) > o->vfu_mig_buf_size) {
> > >> +        warn_report("vfu: buffer overflow - check pending_bytes");
> > >> +        size = o->vfu_mig_buf_size - offset;
> > >> +    }
> > >> +
> > >> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
> > >> +
> > >> +    o->vfu_mig_buf_pending -= size;
> > >
> > > This assumes that the caller increments offset by size each time. If
> > > that assumption is okay, then we can just trust offset and don't need to
> > > do arithmetic on vfu_mig_buf_pending. If that assumption is not correct,
> > > then the code needs to be extended to safely update vfu_mig_buf_pending
> > > when offset jumps around arbitrarily between calls.
> >
> > Going by the definition of vfu_migration_callbacks_t in the library, I assumed
> > that read_data advances the offset by size bytes.
> >
> > Will add a comment a comment to explain that.

libvfio-user does not automatically increment offset by size each time, since
the vfio-user client can re-read the migration data multiple times. In
libvfio-user API we state:

    Function that is called to read migration data. offset and size can be
    any subrange on the offset and size previously returned by prepare_data.

Reading the pending_bytes register is what marks the end of the iteration, and
this is where you need to decrement vfu_mig_buf_pending.

I'll add more unit tests to libvfio-user to validate this behavior.

> >
> > >
> > >> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
> > >> +{
> > >> +    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
> > >> +    const VMStateField *field = NULL;
> > >> +    uint64_t size = 0;
> > >> +
> > >> +    if (!dc->vmsd) {
> > >> +        return 0;
> > >> +    }
> > >> +
> > >> +    field = dc->vmsd->fields;
> > >> +    while (field && field->name) {
> > >> +        size += vmstate_size(pci_dev, field);
> > >> +        field++;
> > >> +    }
> > >> +
> > >> +    return size;
> > >> +}
> > >
> > > This function looks incorrect because it ignores subsections as well as
> > > runtime behavior during save(). Although VMStateDescription is partially
> > > declarative, there is still a bunch of imperative code that can write to
> > > the QEMUFile at save() time so there's no way of knowing the size ahead
> > > of time.
> >
> > I see your point, it would be a problem for any field which has the
> > (VMS_BUFFER | VMS_ALLOC) flags set.
> >
> > >
> > > I asked this in a previous revision of this series but I'm not sure if
> > > it was answered: is it really necessary to know the size of the vmstate?
> > > I thought the VFIO migration interface is designed to support
> > > streaming reads/writes. We could choose a fixed size like 64KB and
> > > stream the vmstate in 64KB chunks.
> >
> > The library exposes the migration data to the client as a device BAR with
> > fixed size - the size of which is fixed at boot time, even when using
> > vfu_migration_callbacks_t callbacks.
> >
> > I don’t believe the library supports streaming vmstate/migration-data - see
> > the following comment in migration_region_access() defined in the library:
> >
> > * Does this mean that partial reads are not allowed?
> >
> > Thanos or John,
> >
> >     Could you please clarify this?

libvfio-user does support streaming of migration data, this comment is based on
the VFIO documentation:

    d. Read data_size bytes of data from (region + data_offset) from the
        migration region.

It's not clear to me whether streaming should be allowed, I'd be surprised if
it didn't.

> >
> > Stefan,
> >     We attempted to answer the migration cancellation and vmstate size
> >     questions previously also, in the following email:
> >
> > https://lore.kernel.org/all/F48606B1-15A4-4DD2-9D71-
> 2FCAFC0E671F@oracle.com/
> 
> >  libvfio-user has the vfu_migration_callbacks_t interface that allows the
> >  device to save/load more data regardless of the size of the migration
> >  region. I don't see the issue here since the region doesn't need to be
> >  sized to fit the savevm data?
> 
> The answer didn't make sense to me:
> 
> "In both scenarios at the server end - whether using the migration BAR or
> using callbacks, the migration data is transported to the other end using
> the BAR. As such we need to specify the BAR’s size during initialization.
> 
> In the case of the callbacks, the library translates the BAR access to callbacks."
> 
> The BAR and the migration region within it need a size but my
> understanding is that VFIO migration is designed to stream the device
> state, allowing it to be broken up into multiple reads/writes with
> knowing the device state's size upfront. Here is the description from
> <linux/vfio.h>:
> 
>   * The sequence to be followed while in pre-copy state and stop-and-copy state
>   * is as follows:
>   * a. Read pending_bytes, indicating the start of a new iteration to get device
>   *    data. Repeated read on pending_bytes at this stage should have no side
>   *    effects.
>   *    If pending_bytes == 0, the user application should not iterate to get data
>   *    for that device.
>   *    If pending_bytes > 0, perform the following steps.
>   * b. Read data_offset, indicating that the vendor driver should make data
>   *    available through the data section. The vendor driver should return this
>   *    read operation only after data is available from (region + data_offset)
>   *    to (region + data_offset + data_size).
>   * c. Read data_size, which is the amount of data in bytes available through
>   *    the migration region.
>   *    Read on data_offset and data_size should return the offset and size of
>   *    the current buffer if the user application reads data_offset and
>   *    data_size more than once here.
>   * d. Read data_size bytes of data from (region + data_offset) from the
>   *    migration region.
>   * e. Process the data.
>   * f. Read pending_bytes, which indicates that the data from the previous
>   *    iteration has been read. If pending_bytes > 0, go to step b.
>   *
>   * The user application can transition from the _SAVING|_RUNNING
>   * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
>   * number of pending bytes. The user application should iterate in _SAVING
>   * (stop-and-copy) until pending_bytes is 0.
> 
> This means you can report pending_bytes > 0 until the entire vmstate has
> been read and can pick a fixed chunk size like 64KB for the migration
> region. There's no need to size the migration region to fit the entire
> vmstate.
> 
> Stefan

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-28  9:18                     ` Stefan Hajnoczi
@ 2022-01-31 16:16                       ` Alex Williamson
  2022-02-01  9:30                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 99+ messages in thread
From: Alex Williamson @ 2022-01-31 16:16 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	Michael S. Tsirkin, qemu-devel, armbru, quintela, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, thanos.makatos,
	Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

On Fri, 28 Jan 2022 09:18:08 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:
> > If the goal here is to restrict DMA between devices, ie. peer-to-peer
> > (p2p), why are we trying to re-invent what an IOMMU already does?  
> 
> The issue Dave raised is that vfio-user servers run in separate
> processses from QEMU with shared memory access to RAM but no direct
> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> example of a non-RAM MemoryRegion that can be the source/target of DMA
> requests.
> 
> I don't think IOMMUs solve this problem but luckily the vfio-user
> protocol already has messages that vfio-user servers can use as a
> fallback when DMA cannot be completed through the shared memory RAM
> accesses.
> 
> > In
> > fact, it seems like an IOMMU does this better in providing an IOVA
> > address space per BDF.  Is the dynamic mapping overhead too much?  What
> > physical hardware properties or specifications could we leverage to
> > restrict p2p mappings to a device?  Should it be governed by machine
> > type to provide consistency between devices?  Should each "isolated"
> > bus be in a separate root complex?  Thanks,  
> 
> There is a separate issue in this patch series regarding isolating the
> address space where BAR accesses are made (i.e. the global
> address_space_memory/io). When one process hosts multiple vfio-user
> server instances (e.g. a software-defined network switch with multiple
> ethernet devices) then each instance needs isolated memory and io address
> spaces so that vfio-user clients don't cause collisions when they map
> BARs to the same address.
> 
> I think the the separate root complex idea is a good solution. This
> patch series takes a different approach by adding the concept of
> isolated address spaces into hw/pci/.

This all still seems pretty sketchy, BARs cannot overlap within the
same vCPU address space, perhaps with the exception of when they're
being sized, but DMA should be disabled during sizing.

Devices within the same VM context with identical BARs would need to
operate in different address spaces.  For example a translation offset
in the vCPU address space would allow unique addressing to the devices,
perhaps using the translation offset bits to address a root complex and
masking those bits for downstream transactions.

In general, the device simply operates in an address space, ie. an
IOVA.  When a mapping is made within that address space, we perform a
translation as necessary to generate a guest physical address.  The
IOVA itself is only meaningful within the context of the address space,
there is no requirement or expectation for it to be globally unique.

If the vfio-user server is making some sort of requirement that IOVAs
are unique across all devices, that seems very, very wrong.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
  2022-01-28  8:29       ` Stefan Hajnoczi
  2022-01-28 14:49         ` Thanos Makatos
@ 2022-02-01  3:49         ` Jag Raman
  2022-02-01  9:37           ` Stefan Hajnoczi
  1 sibling, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-02-01  3:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, John Levon, Markus Armbruster, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	qemu-devel, Juan Quintela, Marc-André Lureau, Paolo Bonzini,
	Thanos Makatos, Eric Blake, Dr . David Alan Gilbert



> On Jan 28, 2022, at 3:29 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> On Thu, Jan 27, 2022 at 05:04:26PM +0000, Jag Raman wrote:
>> 
>> 
>>> On Jan 25, 2022, at 10:48 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>> On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote:
>>>> +     * The client subsequetly asks the remote server for any data that
>>> 
>>> subsequently
>>> 
>>>> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
>>>> +{
>>>> +    VfuObject *o = vfu_get_private(vfu_ctx);
>>>> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
>>>> +    static int migrated_devs;
>>>> +    Error *local_err = NULL;
>>>> +    int ret;
>>>> +
>>>> +    /**
>>>> +     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
>>>> +     * VMSD data from source is not available at RESUME state.
>>>> +     * Working on a fix for this.
>>>> +     */
>>>> +    if (!o->vfu_mig_file) {
>>>> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
>>>> +    }
>>>> +
>>>> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
>>>> +    if (ret) {
>>>> +        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    qemu_file_shutdown(o->vfu_mig_file);
>>>> +    o->vfu_mig_file = NULL;
>>>> +
>>>> +    /* VFU_MIGR_STATE_RUNNING begins here */
>>>> +    if (++migrated_devs == k->nr_devs) {
>>> 
>>> When is this counter reset so migration can be tried again if it
>>> fails/cancels?
>> 
>> Detecting cancellation is a pending item. We will address it in the
>> next rev. Will check with you if  we get stuck during the process
>> of implementing it.
>> 
>>> 
>>>> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
>>>> +                                 uint64_t size, uint64_t offset)
>>>> +{
>>>> +    VfuObject *o = vfu_get_private(vfu_ctx);
>>>> +
>>>> +    if (offset > o->vfu_mig_buf_size) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if ((offset + size) > o->vfu_mig_buf_size) {
>>>> +        warn_report("vfu: buffer overflow - check pending_bytes");
>>>> +        size = o->vfu_mig_buf_size - offset;
>>>> +    }
>>>> +
>>>> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
>>>> +
>>>> +    o->vfu_mig_buf_pending -= size;
>>> 
>>> This assumes that the caller increments offset by size each time. If
>>> that assumption is okay, then we can just trust offset and don't need to
>>> do arithmetic on vfu_mig_buf_pending. If that assumption is not correct,
>>> then the code needs to be extended to safely update vfu_mig_buf_pending
>>> when offset jumps around arbitrarily between calls.
>> 
>> Going by the definition of vfu_migration_callbacks_t in the library, I assumed
>> that read_data advances the offset by size bytes.
>> 
>> Will add a comment a comment to explain that.
>> 
>>> 
>>>> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
>>>> +    const VMStateField *field = NULL;
>>>> +    uint64_t size = 0;
>>>> +
>>>> +    if (!dc->vmsd) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    field = dc->vmsd->fields;
>>>> +    while (field && field->name) {
>>>> +        size += vmstate_size(pci_dev, field);
>>>> +        field++;
>>>> +    }
>>>> +
>>>> +    return size;
>>>> +}
>>> 
>>> This function looks incorrect because it ignores subsections as well as
>>> runtime behavior during save(). Although VMStateDescription is partially
>>> declarative, there is still a bunch of imperative code that can write to
>>> the QEMUFile at save() time so there's no way of knowing the size ahead
>>> of time.
>> 
>> I see your point, it would be a problem for any field which has the
>> (VMS_BUFFER | VMS_ALLOC) flags set.
>> 
>>> 
>>> I asked this in a previous revision of this series but I'm not sure if
>>> it was answered: is it really necessary to know the size of the vmstate?
>>> I thought the VFIO migration interface is designed to support
>>> streaming reads/writes. We could choose a fixed size like 64KB and
>>> stream the vmstate in 64KB chunks.
>> 
>> The library exposes the migration data to the client as a device BAR with
>> fixed size - the size of which is fixed at boot time, even when using
>> vfu_migration_callbacks_t callbacks.
>> 
>> I don’t believe the library supports streaming vmstate/migration-data - see
>> the following comment in migration_region_access() defined in the library:
>> 
>> * Does this mean that partial reads are not allowed?
>> 
>> Thanos or John,
>> 
>>    Could you please clarify this?
>> 
>> Stefan,
>>    We attempted to answer the migration cancellation and vmstate size
>>    questions previously also, in the following email:
>> 
>> https://lore.kernel.org/all/F48606B1-15A4-4DD2-9D71-2FCAFC0E671F@oracle.com/
> 
>> libvfio-user has the vfu_migration_callbacks_t interface that allows the
>> device to save/load more data regardless of the size of the migration
>> region. I don't see the issue here since the region doesn't need to be
>> sized to fit the savevm data?
> 
> The answer didn't make sense to me:
> 
> "In both scenarios at the server end - whether using the migration BAR or
> using callbacks, the migration data is transported to the other end using
> the BAR. As such we need to specify the BAR’s size during initialization.
> 
> In the case of the callbacks, the library translates the BAR access to callbacks."
> 
> The BAR and the migration region within it need a size but my
> understanding is that VFIO migration is designed to stream the device
> state, allowing it to be broken up into multiple reads/writes with
> knowing the device state's size upfront. Here is the description from
> <linux/vfio.h>:
> 
>  * The sequence to be followed while in pre-copy state and stop-and-copy state
>  * is as follows:
>  * a. Read pending_bytes, indicating the start of a new iteration to get device
>  *    data. Repeated read on pending_bytes at this stage should have no side
>  *    effects.
>  *    If pending_bytes == 0, the user application should not iterate to get data
>  *    for that device.
>  *    If pending_bytes > 0, perform the following steps.
>  * b. Read data_offset, indicating that the vendor driver should make data
>  *    available through the data section. The vendor driver should return this
>  *    read operation only after data is available from (region + data_offset)
>  *    to (region + data_offset + data_size).
>  * c. Read data_size, which is the amount of data in bytes available through
>  *    the migration region.
>  *    Read on data_offset and data_size should return the offset and size of
>  *    the current buffer if the user application reads data_offset and
>  *    data_size more than once here.
>  * d. Read data_size bytes of data from (region + data_offset) from the
>  *    migration region.
>  * e. Process the data.
>  * f. Read pending_bytes, which indicates that the data from the previous
>  *    iteration has been read. If pending_bytes > 0, go to step b.
>  *
>  * The user application can transition from the _SAVING|_RUNNING
>  * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
>  * number of pending bytes. The user application should iterate in _SAVING
>  * (stop-and-copy) until pending_bytes is 0.
> 
> This means you can report pending_bytes > 0 until the entire vmstate has
> been read and can pick a fixed chunk size like 64KB for the migration
> region. There's no need to size the migration region to fit the entire
> vmstate.

Thank you for the pointer to generic VFIO migration, Stefan! Makes sense.

So I understand that the VFIO migration region carves out a section to
stream/shuttle device data between the app (QEMU client in this case) and the
driver (QEMU server). This section starts at data_offset within the region and spans
data_size bytes.

We could change the server to stream the data as outlined above. Do you have a
preference for the section size? Does qemu_target_page_size() work? I just tested
and am able to stream with a fixed BAR size such as qemu_target_page_size().

Thank you!
--
Jag

> 
> Stefan


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-31 16:16                       ` Alex Williamson
@ 2022-02-01  9:30                         ` Stefan Hajnoczi
  2022-02-01 15:24                           ` Alex Williamson
  0 siblings, 1 reply; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-02-01  9:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	Michael S. Tsirkin, qemu-devel, armbru, quintela, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, thanos.makatos,
	Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 3956 bytes --]

On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
> On Fri, 28 Jan 2022 09:18:08 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:
> > > If the goal here is to restrict DMA between devices, ie. peer-to-peer
> > > (p2p), why are we trying to re-invent what an IOMMU already does?  
> > 
> > The issue Dave raised is that vfio-user servers run in separate
> > processses from QEMU with shared memory access to RAM but no direct
> > access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> > example of a non-RAM MemoryRegion that can be the source/target of DMA
> > requests.
> > 
> > I don't think IOMMUs solve this problem but luckily the vfio-user
> > protocol already has messages that vfio-user servers can use as a
> > fallback when DMA cannot be completed through the shared memory RAM
> > accesses.
> > 
> > > In
> > > fact, it seems like an IOMMU does this better in providing an IOVA
> > > address space per BDF.  Is the dynamic mapping overhead too much?  What
> > > physical hardware properties or specifications could we leverage to
> > > restrict p2p mappings to a device?  Should it be governed by machine
> > > type to provide consistency between devices?  Should each "isolated"
> > > bus be in a separate root complex?  Thanks,  
> > 
> > There is a separate issue in this patch series regarding isolating the
> > address space where BAR accesses are made (i.e. the global
> > address_space_memory/io). When one process hosts multiple vfio-user
> > server instances (e.g. a software-defined network switch with multiple
> > ethernet devices) then each instance needs isolated memory and io address
> > spaces so that vfio-user clients don't cause collisions when they map
> > BARs to the same address.
> > 
> > I think the the separate root complex idea is a good solution. This
> > patch series takes a different approach by adding the concept of
> > isolated address spaces into hw/pci/.
> 
> This all still seems pretty sketchy, BARs cannot overlap within the
> same vCPU address space, perhaps with the exception of when they're
> being sized, but DMA should be disabled during sizing.
> 
> Devices within the same VM context with identical BARs would need to
> operate in different address spaces.  For example a translation offset
> in the vCPU address space would allow unique addressing to the devices,
> perhaps using the translation offset bits to address a root complex and
> masking those bits for downstream transactions.
> 
> In general, the device simply operates in an address space, ie. an
> IOVA.  When a mapping is made within that address space, we perform a
> translation as necessary to generate a guest physical address.  The
> IOVA itself is only meaningful within the context of the address space,
> there is no requirement or expectation for it to be globally unique.
> 
> If the vfio-user server is making some sort of requirement that IOVAs
> are unique across all devices, that seems very, very wrong.  Thanks,

Yes, BARs and IOVAs don't need to be unique across all devices.

The issue is that there can be as many guest physical address spaces as
there are vfio-user clients connected, so per-client isolated address
spaces are required. This patch series has a solution to that problem
with the new pci_isol_as_mem/io() API.

What I find strange about this approach is that exported PCI devices are
on PCI root ports that are connected to the machine's main PCI bus. The
PCI devices don't interact with the main bus's IOVA space, guest
physical memory space, or interrupts. It seems hacky to graft isolated
devices onto a parent bus that provides nothing to its children. I
wonder if it would be cleaner for every vfio-user server to have its own
PCIHost. Then it may be possible to drop the new pci_isol_as_mem/io()
API.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
  2022-02-01  3:49         ` Jag Raman
@ 2022-02-01  9:37           ` Stefan Hajnoczi
  0 siblings, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-02-01  9:37 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, John Johnson, Daniel P. Berrangé,
	Beraldo Leal, John Levon, Markus Armbruster, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	qemu-devel, Juan Quintela, Marc-André Lureau, Paolo Bonzini,
	Thanos Makatos, Eric Blake, Dr . David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 9318 bytes --]

On Tue, Feb 01, 2022 at 03:49:40AM +0000, Jag Raman wrote:
> 
> 
> > On Jan 28, 2022, at 3:29 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Thu, Jan 27, 2022 at 05:04:26PM +0000, Jag Raman wrote:
> >> 
> >> 
> >>> On Jan 25, 2022, at 10:48 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> 
> >>> On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote:
> >>>> +     * The client subsequetly asks the remote server for any data that
> >>> 
> >>> subsequently
> >>> 
> >>>> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
> >>>> +{
> >>>> +    VfuObject *o = vfu_get_private(vfu_ctx);
> >>>> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
> >>>> +    static int migrated_devs;
> >>>> +    Error *local_err = NULL;
> >>>> +    int ret;
> >>>> +
> >>>> +    /**
> >>>> +     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
> >>>> +     * VMSD data from source is not available at RESUME state.
> >>>> +     * Working on a fix for this.
> >>>> +     */
> >>>> +    if (!o->vfu_mig_file) {
> >>>> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
> >>>> +    }
> >>>> +
> >>>> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
> >>>> +    if (ret) {
> >>>> +        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    qemu_file_shutdown(o->vfu_mig_file);
> >>>> +    o->vfu_mig_file = NULL;
> >>>> +
> >>>> +    /* VFU_MIGR_STATE_RUNNING begins here */
> >>>> +    if (++migrated_devs == k->nr_devs) {
> >>> 
> >>> When is this counter reset so migration can be tried again if it
> >>> fails/cancels?
> >> 
> >> Detecting cancellation is a pending item. We will address it in the
> >> next rev. Will check with you if  we get stuck during the process
> >> of implementing it.
> >> 
> >>> 
> >>>> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
> >>>> +                                 uint64_t size, uint64_t offset)
> >>>> +{
> >>>> +    VfuObject *o = vfu_get_private(vfu_ctx);
> >>>> +
> >>>> +    if (offset > o->vfu_mig_buf_size) {
> >>>> +        return -1;
> >>>> +    }
> >>>> +
> >>>> +    if ((offset + size) > o->vfu_mig_buf_size) {
> >>>> +        warn_report("vfu: buffer overflow - check pending_bytes");
> >>>> +        size = o->vfu_mig_buf_size - offset;
> >>>> +    }
> >>>> +
> >>>> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
> >>>> +
> >>>> +    o->vfu_mig_buf_pending -= size;
> >>> 
> >>> This assumes that the caller increments offset by size each time. If
> >>> that assumption is okay, then we can just trust offset and don't need to
> >>> do arithmetic on vfu_mig_buf_pending. If that assumption is not correct,
> >>> then the code needs to be extended to safely update vfu_mig_buf_pending
> >>> when offset jumps around arbitrarily between calls.
> >> 
> >> Going by the definition of vfu_migration_callbacks_t in the library, I assumed
> >> that read_data advances the offset by size bytes.
> >> 
> >> Will add a comment a comment to explain that.
> >> 
> >>> 
> >>>> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
> >>>> +{
> >>>> +    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
> >>>> +    const VMStateField *field = NULL;
> >>>> +    uint64_t size = 0;
> >>>> +
> >>>> +    if (!dc->vmsd) {
> >>>> +        return 0;
> >>>> +    }
> >>>> +
> >>>> +    field = dc->vmsd->fields;
> >>>> +    while (field && field->name) {
> >>>> +        size += vmstate_size(pci_dev, field);
> >>>> +        field++;
> >>>> +    }
> >>>> +
> >>>> +    return size;
> >>>> +}
> >>> 
> >>> This function looks incorrect because it ignores subsections as well as
> >>> runtime behavior during save(). Although VMStateDescription is partially
> >>> declarative, there is still a bunch of imperative code that can write to
> >>> the QEMUFile at save() time so there's no way of knowing the size ahead
> >>> of time.
> >> 
> >> I see your point, it would be a problem for any field which has the
> >> (VMS_BUFFER | VMS_ALLOC) flags set.
> >> 
> >>> 
> >>> I asked this in a previous revision of this series but I'm not sure if
> >>> it was answered: is it really necessary to know the size of the vmstate?
> >>> I thought the VFIO migration interface is designed to support
> >>> streaming reads/writes. We could choose a fixed size like 64KB and
> >>> stream the vmstate in 64KB chunks.
> >> 
> >> The library exposes the migration data to the client as a device BAR with
> >> fixed size - the size of which is fixed at boot time, even when using
> >> vfu_migration_callbacks_t callbacks.
> >> 
> >> I don’t believe the library supports streaming vmstate/migration-data - see
> >> the following comment in migration_region_access() defined in the library:
> >> 
> >> * Does this mean that partial reads are not allowed?
> >> 
> >> Thanos or John,
> >> 
> >>    Could you please clarify this?
> >> 
> >> Stefan,
> >>    We attempted to answer the migration cancellation and vmstate size
> >>    questions previously also, in the following email:
> >> 
> >> https://lore.kernel.org/all/F48606B1-15A4-4DD2-9D71-2FCAFC0E671F@oracle.com/
> > 
> >> libvfio-user has the vfu_migration_callbacks_t interface that allows the
> >> device to save/load more data regardless of the size of the migration
> >> region. I don't see the issue here since the region doesn't need to be
> >> sized to fit the savevm data?
> > 
> > The answer didn't make sense to me:
> > 
> > "In both scenarios at the server end - whether using the migration BAR or
> > using callbacks, the migration data is transported to the other end using
> > the BAR. As such we need to specify the BAR’s size during initialization.
> > 
> > In the case of the callbacks, the library translates the BAR access to callbacks."
> > 
> > The BAR and the migration region within it need a size but my
> > understanding is that VFIO migration is designed to stream the device
> > state, allowing it to be broken up into multiple reads/writes with
> > knowing the device state's size upfront. Here is the description from
> > <linux/vfio.h>:
> > 
> >  * The sequence to be followed while in pre-copy state and stop-and-copy state
> >  * is as follows:
> >  * a. Read pending_bytes, indicating the start of a new iteration to get device
> >  *    data. Repeated read on pending_bytes at this stage should have no side
> >  *    effects.
> >  *    If pending_bytes == 0, the user application should not iterate to get data
> >  *    for that device.
> >  *    If pending_bytes > 0, perform the following steps.
> >  * b. Read data_offset, indicating that the vendor driver should make data
> >  *    available through the data section. The vendor driver should return this
> >  *    read operation only after data is available from (region + data_offset)
> >  *    to (region + data_offset + data_size).
> >  * c. Read data_size, which is the amount of data in bytes available through
> >  *    the migration region.
> >  *    Read on data_offset and data_size should return the offset and size of
> >  *    the current buffer if the user application reads data_offset and
> >  *    data_size more than once here.
> >  * d. Read data_size bytes of data from (region + data_offset) from the
> >  *    migration region.
> >  * e. Process the data.
> >  * f. Read pending_bytes, which indicates that the data from the previous
> >  *    iteration has been read. If pending_bytes > 0, go to step b.
> >  *
> >  * The user application can transition from the _SAVING|_RUNNING
> >  * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
> >  * number of pending bytes. The user application should iterate in _SAVING
> >  * (stop-and-copy) until pending_bytes is 0.
> > 
> > This means you can report pending_bytes > 0 until the entire vmstate has
> > been read and can pick a fixed chunk size like 64KB for the migration
> > region. There's no need to size the migration region to fit the entire
> > vmstate.
> 
> Thank you for the pointer to generic VFIO migration, Stefan! Makes sense.
> 
> So I understand that the VFIO migration region carves out a section to
> stream/shuttle device data between the app (QEMU client in this case) and the
> driver (QEMU server). This section starts at data_offset within the region and spans
> data_size bytes.
> 
> We could change the server to stream the data as outlined above. Do you have a
> preference for the section size? Does qemu_target_page_size() work? I just tested
> and am able to stream with a fixed BAR size such as qemu_target_page_size().

The VFIO migration API requires that data is written in the same chunk
sizes as it was read, so there is no way to merge or split chunks for
performance reasons once they have been read.

4KB may result in lots of chunks and that means more network traffic and
read()/write() calls. I think it's too small.

Something large like 1MB might create issues with responsiveness because
a 1MB chunk hogs the migration stream and read()/write() latency could
hog the event loop.

I'd go for 64KB. Dave and Juan might also have a suggestion for the size.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-01-27 21:22                   ` Alex Williamson
  2022-01-28  8:19                     ` Stefan Hajnoczi
  2022-01-28  9:18                     ` Stefan Hajnoczi
@ 2022-02-01 10:42                     ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 99+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-01 10:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	Michael S. Tsirkin, qemu-devel, armbru, quintela, Paolo Bonzini,
	Marc-André Lureau, Stefan Hajnoczi, thanos.makatos,
	Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Thu, 27 Jan 2022 08:30:13 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Wed, Jan 26, 2022 at 04:13:33PM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Jan 26, 2022 at 08:07:36PM +0000, Dr. David Alan Gilbert wrote:  
> > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:  
> > > > > On Wed, Jan 26, 2022 at 05:27:32AM +0000, Jag Raman wrote:  
> > > > > > 
> > > > > >   
> > > > > > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > > > > > 
> > > > > > > * Jag Raman (jag.raman@oracle.com) wrote:  
> > > > > > >> 
> > > > > > >>   
> > > > > > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > >>> 
> > > > > > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:  
> > > > > > >>>> Allow PCI buses to be part of isolated CPU address spaces. This has a
> > > > > > >>>> niche usage.
> > > > > > >>>> 
> > > > > > >>>> TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> > > > > > >>>> the same machine/server. This would cause address space collision as
> > > > > > >>>> well as be a security vulnerability. Having separate address spaces for
> > > > > > >>>> each PCI bus would solve this problem.  
> > > > > > >>> 
> > > > > > >>> Fascinating, but I am not sure I understand. any examples?  
> > > > > > >> 
> > > > > > >> Hi Michael!
> > > > > > >> 
> > > > > > >> multiprocess QEMU and vfio-user implement a client-server model to allow
> > > > > > >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> > > > > > >> to the kernel and runs VCPUs, could attach devices running in a server
> > > > > > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > > > > > >> perform DMA.  
> > > > > > > 
> > > > > > > Do you ever have the opposite problem? i.e. when an emulated PCI device  
> > > > > > 
> > > > > > That’s an interesting question.
> > > > > >   
> > > > > > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > > > > > that the client can see.  What happens if two emulated devices need to
> > > > > > > access each others emulated address space?  
> > > > > > 
> > > > > > In this case, the kernel driver would map the destination’s chunk of internal RAM into
> > > > > > the DMA space of the source device. Then the source device could write to that
> > > > > > mapped address range, and the IOMMU should direct those writes to the
> > > > > > destination device.
> > > > > > 
> > > > > > I would like to take a closer look at the IOMMU implementation on how to achieve
> > > > > > this, and get back to you. I think the IOMMU would handle this. Could you please
> > > > > > point me to the IOMMU implementation you have in mind?  
> > > > > 
> > > > > I don't know if the current vfio-user client/server patches already
> > > > > implement device-to-device DMA, but the functionality is supported by
> > > > > the vfio-user protocol.
> > > > > 
> > > > > Basically: if the DMA regions lookup inside the vfio-user server fails,
> > > > > fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> > > > > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> > > > > 
> > > > > Here is the flow:
> > > > > 1. The vfio-user server with device A sends a DMA read to QEMU.
> > > > > 2. QEMU finds the MemoryRegion associated with the DMA address and sees
> > > > >    it's a device.
> > > > >    a. If it's emulated inside the QEMU process then the normal
> > > > >       device emulation code kicks in.
> > > > >    b. If it's another vfio-user PCI device then the vfio-user PCI proxy
> > > > >       device forwards the DMA to the second vfio-user server's device B.  
> > > > 
> > > > I'm starting to be curious if there's a way to persuade the guest kernel
> > > > to do it for us; in general is there a way to say to PCI devices that
> > > > they can only DMA to the host and not other PCI devices?  
> > > 
> > > 
> > > But of course - this is how e.g. VFIO protects host PCI devices from
> > > each other when one of them is passed through to a VM.  
> > 
> > Michael: Are you saying just turn on vIOMMU? :)
> > 
> > Devices in different VFIO groups have their own IOMMU context, so their
> > IOVA space is isolated. Just don't map other devices into the IOVA space
> > and those other devices will be inaccessible.
> 
> Devices in different VFIO *containers* have their own IOMMU context.
> Based on the group attachment to a container, groups can either have
> shared or isolated IOVA space.  That determination is made by looking
> at the address space of the bus, which is governed by the presence of a
> vIOMMU.
> 
> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> (p2p), why are we trying to re-invent what an IOMMU already does?

That was what I was curious about - is it possible to get an IOMMU to do
that, and how? (Not knowing much about IOMMUs).
In my DAX/virtiofs case, I want the device to be able to DMA to guest
RAM but for other devices not to try to DMA to it and in particular for
it not to have to DMA to other devices.

>  In
> fact, it seems like an IOMMU does this better in providing an IOVA
> address space per BDF.  Is the dynamic mapping overhead too much?  What
> physical hardware properties or specifications could we leverage to
> restrict p2p mappings to a device?  Should it be governed by machine
> type to provide consistency between devices?  Should each "isolated"
> bus be in a separate root complex?  Thanks,

Dave

> Alex
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-01  9:30                         ` Stefan Hajnoczi
@ 2022-02-01 15:24                           ` Alex Williamson
  2022-02-01 21:24                             ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Alex Williamson @ 2022-02-01 15:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	Michael S. Tsirkin, qemu-devel, armbru, quintela, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, thanos.makatos,
	Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

On Tue, 1 Feb 2022 09:30:35 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
> > On Fri, 28 Jan 2022 09:18:08 +0000
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >   
> > > On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:  
> > > > If the goal here is to restrict DMA between devices, ie. peer-to-peer
> > > > (p2p), why are we trying to re-invent what an IOMMU already does?    
> > > 
> > > The issue Dave raised is that vfio-user servers run in separate
> > > processses from QEMU with shared memory access to RAM but no direct
> > > access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> > > example of a non-RAM MemoryRegion that can be the source/target of DMA
> > > requests.
> > > 
> > > I don't think IOMMUs solve this problem but luckily the vfio-user
> > > protocol already has messages that vfio-user servers can use as a
> > > fallback when DMA cannot be completed through the shared memory RAM
> > > accesses.
> > >   
> > > > In
> > > > fact, it seems like an IOMMU does this better in providing an IOVA
> > > > address space per BDF.  Is the dynamic mapping overhead too much?  What
> > > > physical hardware properties or specifications could we leverage to
> > > > restrict p2p mappings to a device?  Should it be governed by machine
> > > > type to provide consistency between devices?  Should each "isolated"
> > > > bus be in a separate root complex?  Thanks,    
> > > 
> > > There is a separate issue in this patch series regarding isolating the
> > > address space where BAR accesses are made (i.e. the global
> > > address_space_memory/io). When one process hosts multiple vfio-user
> > > server instances (e.g. a software-defined network switch with multiple
> > > ethernet devices) then each instance needs isolated memory and io address
> > > spaces so that vfio-user clients don't cause collisions when they map
> > > BARs to the same address.
> > > 
> > > I think the the separate root complex idea is a good solution. This
> > > patch series takes a different approach by adding the concept of
> > > isolated address spaces into hw/pci/.  
> > 
> > This all still seems pretty sketchy, BARs cannot overlap within the
> > same vCPU address space, perhaps with the exception of when they're
> > being sized, but DMA should be disabled during sizing.
> > 
> > Devices within the same VM context with identical BARs would need to
> > operate in different address spaces.  For example a translation offset
> > in the vCPU address space would allow unique addressing to the devices,
> > perhaps using the translation offset bits to address a root complex and
> > masking those bits for downstream transactions.
> > 
> > In general, the device simply operates in an address space, ie. an
> > IOVA.  When a mapping is made within that address space, we perform a
> > translation as necessary to generate a guest physical address.  The
> > IOVA itself is only meaningful within the context of the address space,
> > there is no requirement or expectation for it to be globally unique.
> > 
> > If the vfio-user server is making some sort of requirement that IOVAs
> > are unique across all devices, that seems very, very wrong.  Thanks,  
> 
> Yes, BARs and IOVAs don't need to be unique across all devices.
> 
> The issue is that there can be as many guest physical address spaces as
> there are vfio-user clients connected, so per-client isolated address
> spaces are required. This patch series has a solution to that problem
> with the new pci_isol_as_mem/io() API.

Sorry, this still doesn't follow for me.  A server that hosts multiple
devices across many VMs (I'm not sure if you're referring to the device
or the VM as a client) needs to deal with different address spaces per
device.  The server needs to be able to uniquely identify every DMA,
which must be part of the interface protocol.  But I don't see how that
imposes a requirement of an isolated address space.  If we want the
device isolated because we don't trust the server, that's where an IOMMU
provides per device isolation.  What is the restriction of the
per-client isolated address space and why do we need it?  The server
needing to support multiple clients is not a sufficient answer to
impose new PCI bus types with an implicit restriction on the VM.
 
> What I find strange about this approach is that exported PCI devices are
> on PCI root ports that are connected to the machine's main PCI bus. The
> PCI devices don't interact with the main bus's IOVA space, guest
> physical memory space, or interrupts. It seems hacky to graft isolated
> devices onto a parent bus that provides nothing to its children. I
> wonder if it would be cleaner for every vfio-user server to have its own
> PCIHost. Then it may be possible to drop the new pci_isol_as_mem/io()
> API.

This is getting a bit ridiculous, if vfio-user devices require this
degree of manipulation of the VM topology into things that don't exist
on bare metal, we've done something very wrong.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-01 15:24                           ` Alex Williamson
@ 2022-02-01 21:24                             ` Jag Raman
  2022-02-01 22:47                               ` Alex Williamson
  0 siblings, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-02-01 21:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, Michael S. Tsirkin, qemu-devel,
	armbru, quintela, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé



> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Tue, 1 Feb 2022 09:30:35 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
>>> On Fri, 28 Jan 2022 09:18:08 +0000
>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:  
>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?    
>>>> 
>>>> The issue Dave raised is that vfio-user servers run in separate
>>>> processses from QEMU with shared memory access to RAM but no direct
>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
>>>> requests.
>>>> 
>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
>>>> protocol already has messages that vfio-user servers can use as a
>>>> fallback when DMA cannot be completed through the shared memory RAM
>>>> accesses.
>>>> 
>>>>> In
>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
>>>>> physical hardware properties or specifications could we leverage to
>>>>> restrict p2p mappings to a device?  Should it be governed by machine
>>>>> type to provide consistency between devices?  Should each "isolated"
>>>>> bus be in a separate root complex?  Thanks,    
>>>> 
>>>> There is a separate issue in this patch series regarding isolating the
>>>> address space where BAR accesses are made (i.e. the global
>>>> address_space_memory/io). When one process hosts multiple vfio-user
>>>> server instances (e.g. a software-defined network switch with multiple
>>>> ethernet devices) then each instance needs isolated memory and io address
>>>> spaces so that vfio-user clients don't cause collisions when they map
>>>> BARs to the same address.
>>>> 
>>>> I think the the separate root complex idea is a good solution. This
>>>> patch series takes a different approach by adding the concept of
>>>> isolated address spaces into hw/pci/.  
>>> 
>>> This all still seems pretty sketchy, BARs cannot overlap within the
>>> same vCPU address space, perhaps with the exception of when they're
>>> being sized, but DMA should be disabled during sizing.
>>> 
>>> Devices within the same VM context with identical BARs would need to
>>> operate in different address spaces.  For example a translation offset
>>> in the vCPU address space would allow unique addressing to the devices,
>>> perhaps using the translation offset bits to address a root complex and
>>> masking those bits for downstream transactions.
>>> 
>>> In general, the device simply operates in an address space, ie. an
>>> IOVA.  When a mapping is made within that address space, we perform a
>>> translation as necessary to generate a guest physical address.  The
>>> IOVA itself is only meaningful within the context of the address space,
>>> there is no requirement or expectation for it to be globally unique.
>>> 
>>> If the vfio-user server is making some sort of requirement that IOVAs
>>> are unique across all devices, that seems very, very wrong.  Thanks,  
>> 
>> Yes, BARs and IOVAs don't need to be unique across all devices.
>> 
>> The issue is that there can be as many guest physical address spaces as
>> there are vfio-user clients connected, so per-client isolated address
>> spaces are required. This patch series has a solution to that problem
>> with the new pci_isol_as_mem/io() API.
> 
> Sorry, this still doesn't follow for me.  A server that hosts multiple
> devices across many VMs (I'm not sure if you're referring to the device
> or the VM as a client) needs to deal with different address spaces per
> device.  The server needs to be able to uniquely identify every DMA,
> which must be part of the interface protocol.  But I don't see how that
> imposes a requirement of an isolated address space.  If we want the
> device isolated because we don't trust the server, that's where an IOMMU
> provides per device isolation.  What is the restriction of the
> per-client isolated address space and why do we need it?  The server
> needing to support multiple clients is not a sufficient answer to
> impose new PCI bus types with an implicit restriction on the VM.

Hi Alex,

I believe there are two separate problems with running PCI devices in
the vfio-user server. The first one is concerning memory isolation and
second one is vectoring of BAR accesses (as explained below).

In our previous patches (v3), we used an IOMMU to isolate memory
spaces. But we still had trouble with the vectoring. So we implemented
separate address spaces for each PCIBus to tackle both problems
simultaneously, based on the feedback we got.

The following gives an overview of issues concerning vectoring of
BAR accesses.

The device’s BAR regions are mapped into the guest physical address
space. The guest writes the guest PA of each BAR into the device’s BAR
registers. To access the BAR regions of the device, QEMU uses
address_space_rw() which vectors the physical address access to the
device BAR region handlers.

The PCIBus data structure already has address_space_mem and
address_space_io to contain the BAR regions of devices attached
to it. I understand that these two PCIBus members form the
PCI address space.

Typically, the machines map the PCI address space into the system address
space. For example, pc_pci_as_mapping_init() does this for ‘pc' machine types.
As such, there is a 1:1 mapping between system address space and PCI address
space of the root bus. Since all the PCI devices in the machine are assigned to
the same VM, we could map the PCI address space of all PCI buses to the same
system address space.

Whereas in the case of vfio-user, the devices running in the server could
belong to different VMs. Therefore, along with the physical address, we would
need to know the address space that the device belongs for
address_space_rw() to successfully vector BAR accesses into the PCI device.

Thank you!
--
Jag

> 
>> What I find strange about this approach is that exported PCI devices are
>> on PCI root ports that are connected to the machine's main PCI bus. The
>> PCI devices don't interact with the main bus's IOVA space, guest
>> physical memory space, or interrupts. It seems hacky to graft isolated
>> devices onto a parent bus that provides nothing to its children. I
>> wonder if it would be cleaner for every vfio-user server to have its own
>> PCIHost. Then it may be possible to drop the new pci_isol_as_mem/io()
>> API.
> 
> This is getting a bit ridiculous, if vfio-user devices require this
> degree of manipulation of the VM topology into things that don't exist
> on bare metal, we've done something very wrong.  Thanks,
> 
> Alex
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-01 21:24                             ` Jag Raman
@ 2022-02-01 22:47                               ` Alex Williamson
  2022-02-02  1:13                                 ` Jag Raman
  2022-02-02  9:30                                 ` Peter Maydell
  0 siblings, 2 replies; 99+ messages in thread
From: Alex Williamson @ 2022-02-01 22:47 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, Michael S. Tsirkin, qemu-devel,
	armbru, quintela, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé

On Tue, 1 Feb 2022 21:24:08 +0000
Jag Raman <jag.raman@oracle.com> wrote:

> > On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > On Tue, 1 Feb 2022 09:30:35 +0000
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >   
> >> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:  
> >>> On Fri, 28 Jan 2022 09:18:08 +0000
> >>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>   
> >>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:    
> >>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> >>>>> (p2p), why are we trying to re-invent what an IOMMU already does?      
> >>>> 
> >>>> The issue Dave raised is that vfio-user servers run in separate
> >>>> processses from QEMU with shared memory access to RAM but no direct
> >>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> >>>> requests.
> >>>> 
> >>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> >>>> protocol already has messages that vfio-user servers can use as a
> >>>> fallback when DMA cannot be completed through the shared memory RAM
> >>>> accesses.
> >>>>   
> >>>>> In
> >>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
> >>>>> physical hardware properties or specifications could we leverage to
> >>>>> restrict p2p mappings to a device?  Should it be governed by machine
> >>>>> type to provide consistency between devices?  Should each "isolated"
> >>>>> bus be in a separate root complex?  Thanks,      
> >>>> 
> >>>> There is a separate issue in this patch series regarding isolating the
> >>>> address space where BAR accesses are made (i.e. the global
> >>>> address_space_memory/io). When one process hosts multiple vfio-user
> >>>> server instances (e.g. a software-defined network switch with multiple
> >>>> ethernet devices) then each instance needs isolated memory and io address
> >>>> spaces so that vfio-user clients don't cause collisions when they map
> >>>> BARs to the same address.
> >>>> 
> >>>> I think the the separate root complex idea is a good solution. This
> >>>> patch series takes a different approach by adding the concept of
> >>>> isolated address spaces into hw/pci/.    
> >>> 
> >>> This all still seems pretty sketchy, BARs cannot overlap within the
> >>> same vCPU address space, perhaps with the exception of when they're
> >>> being sized, but DMA should be disabled during sizing.
> >>> 
> >>> Devices within the same VM context with identical BARs would need to
> >>> operate in different address spaces.  For example a translation offset
> >>> in the vCPU address space would allow unique addressing to the devices,
> >>> perhaps using the translation offset bits to address a root complex and
> >>> masking those bits for downstream transactions.
> >>> 
> >>> In general, the device simply operates in an address space, ie. an
> >>> IOVA.  When a mapping is made within that address space, we perform a
> >>> translation as necessary to generate a guest physical address.  The
> >>> IOVA itself is only meaningful within the context of the address space,
> >>> there is no requirement or expectation for it to be globally unique.
> >>> 
> >>> If the vfio-user server is making some sort of requirement that IOVAs
> >>> are unique across all devices, that seems very, very wrong.  Thanks,    
> >> 
> >> Yes, BARs and IOVAs don't need to be unique across all devices.
> >> 
> >> The issue is that there can be as many guest physical address spaces as
> >> there are vfio-user clients connected, so per-client isolated address
> >> spaces are required. This patch series has a solution to that problem
> >> with the new pci_isol_as_mem/io() API.  
> > 
> > Sorry, this still doesn't follow for me.  A server that hosts multiple
> > devices across many VMs (I'm not sure if you're referring to the device
> > or the VM as a client) needs to deal with different address spaces per
> > device.  The server needs to be able to uniquely identify every DMA,
> > which must be part of the interface protocol.  But I don't see how that
> > imposes a requirement of an isolated address space.  If we want the
> > device isolated because we don't trust the server, that's where an IOMMU
> > provides per device isolation.  What is the restriction of the
> > per-client isolated address space and why do we need it?  The server
> > needing to support multiple clients is not a sufficient answer to
> > impose new PCI bus types with an implicit restriction on the VM.  
> 
> Hi Alex,
> 
> I believe there are two separate problems with running PCI devices in
> the vfio-user server. The first one is concerning memory isolation and
> second one is vectoring of BAR accesses (as explained below).
> 
> In our previous patches (v3), we used an IOMMU to isolate memory
> spaces. But we still had trouble with the vectoring. So we implemented
> separate address spaces for each PCIBus to tackle both problems
> simultaneously, based on the feedback we got.
> 
> The following gives an overview of issues concerning vectoring of
> BAR accesses.
> 
> The device’s BAR regions are mapped into the guest physical address
> space. The guest writes the guest PA of each BAR into the device’s BAR
> registers. To access the BAR regions of the device, QEMU uses
> address_space_rw() which vectors the physical address access to the
> device BAR region handlers.

The guest physical address written to the BAR is irrelevant from the
device perspective, this only serves to assign the BAR an offset within
the address_space_mem, which is used by the vCPU (and possibly other
devices depending on their address space).  There is no reason for the
device itself to care about this address.
 
> The PCIBus data structure already has address_space_mem and
> address_space_io to contain the BAR regions of devices attached
> to it. I understand that these two PCIBus members form the
> PCI address space.

These are the CPU address spaces.  When there's no IOMMU, the PCI bus is
identity mapped to the CPU address space.  When there is an IOMMU, the
device address space is determined by the granularity of the IOMMU and
may be entirely separate from address_space_mem.

I/O port space is always the identity mapped CPU address space unless
sparse translations are used to create multiple I/O port spaces (not
implemented).  I/O port space is only accessed by the CPU, there are no
device initiated I/O port transactions, so the address space relative
to the device is irrelevant.

> Typically, the machines map the PCI address space into the system address
> space. For example, pc_pci_as_mapping_init() does this for ‘pc' machine types.
> As such, there is a 1:1 mapping between system address space and PCI address
> space of the root bus. Since all the PCI devices in the machine are assigned to
> the same VM, we could map the PCI address space of all PCI buses to the same
> system address space.

"Typically" only if we're restricted to the "pc", ie. i440FX, machine
type since it doesn't support a vIOMMU.  There's no reason to focus on
the identity map case versus the vIOMMU case.

> Whereas in the case of vfio-user, the devices running in the server could
> belong to different VMs. Therefore, along with the physical address, we would
> need to know the address space that the device belongs for
> address_space_rw() to successfully vector BAR accesses into the PCI device.

But as far as device initiated transactions, there is only one address
space for a given device, it's either address_space_mem or one provided
by the vIOMMU and pci_device_iommu_address_space() tells us that
address space.  Furthermore, the device never operates on a "physical
address", it only ever operates on an IOVA, ie. an offset within the
address space assigned to the device.  The IOVA should be considered
arbitrary relative to mappings in any other address spaces.

Device initiated transactions operate on an IOVA within the (single)
address space to which the device is assigned.  Any attempt to do
otherwise violates the isolation put in place by things like vIOMMUs
and ought to be considered a security concern, especially for a device
serviced by an external process.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-01 22:47                               ` Alex Williamson
@ 2022-02-02  1:13                                 ` Jag Raman
  2022-02-02  5:34                                   ` Alex Williamson
  2022-02-02  9:30                                 ` Peter Maydell
  1 sibling, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-02-02  1:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, Michael S. Tsirkin, qemu-devel,
	armbru, quintela, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé



> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Tue, 1 Feb 2022 21:24:08 +0000
> Jag Raman <jag.raman@oracle.com> wrote:
> 
>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>> 
>>> On Tue, 1 Feb 2022 09:30:35 +0000
>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> 
>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:  
>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> 
>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:    
>>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?      
>>>>>> 
>>>>>> The issue Dave raised is that vfio-user servers run in separate
>>>>>> processses from QEMU with shared memory access to RAM but no direct
>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
>>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
>>>>>> requests.
>>>>>> 
>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
>>>>>> protocol already has messages that vfio-user servers can use as a
>>>>>> fallback when DMA cannot be completed through the shared memory RAM
>>>>>> accesses.
>>>>>> 
>>>>>>> In
>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
>>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
>>>>>>> physical hardware properties or specifications could we leverage to
>>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
>>>>>>> type to provide consistency between devices?  Should each "isolated"
>>>>>>> bus be in a separate root complex?  Thanks,      
>>>>>> 
>>>>>> There is a separate issue in this patch series regarding isolating the
>>>>>> address space where BAR accesses are made (i.e. the global
>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
>>>>>> server instances (e.g. a software-defined network switch with multiple
>>>>>> ethernet devices) then each instance needs isolated memory and io address
>>>>>> spaces so that vfio-user clients don't cause collisions when they map
>>>>>> BARs to the same address.
>>>>>> 
>>>>>> I think the the separate root complex idea is a good solution. This
>>>>>> patch series takes a different approach by adding the concept of
>>>>>> isolated address spaces into hw/pci/.    
>>>>> 
>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
>>>>> same vCPU address space, perhaps with the exception of when they're
>>>>> being sized, but DMA should be disabled during sizing.
>>>>> 
>>>>> Devices within the same VM context with identical BARs would need to
>>>>> operate in different address spaces.  For example a translation offset
>>>>> in the vCPU address space would allow unique addressing to the devices,
>>>>> perhaps using the translation offset bits to address a root complex and
>>>>> masking those bits for downstream transactions.
>>>>> 
>>>>> In general, the device simply operates in an address space, ie. an
>>>>> IOVA.  When a mapping is made within that address space, we perform a
>>>>> translation as necessary to generate a guest physical address.  The
>>>>> IOVA itself is only meaningful within the context of the address space,
>>>>> there is no requirement or expectation for it to be globally unique.
>>>>> 
>>>>> If the vfio-user server is making some sort of requirement that IOVAs
>>>>> are unique across all devices, that seems very, very wrong.  Thanks,    
>>>> 
>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
>>>> 
>>>> The issue is that there can be as many guest physical address spaces as
>>>> there are vfio-user clients connected, so per-client isolated address
>>>> spaces are required. This patch series has a solution to that problem
>>>> with the new pci_isol_as_mem/io() API.  
>>> 
>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
>>> devices across many VMs (I'm not sure if you're referring to the device
>>> or the VM as a client) needs to deal with different address spaces per
>>> device.  The server needs to be able to uniquely identify every DMA,
>>> which must be part of the interface protocol.  But I don't see how that
>>> imposes a requirement of an isolated address space.  If we want the
>>> device isolated because we don't trust the server, that's where an IOMMU
>>> provides per device isolation.  What is the restriction of the
>>> per-client isolated address space and why do we need it?  The server
>>> needing to support multiple clients is not a sufficient answer to
>>> impose new PCI bus types with an implicit restriction on the VM.  
>> 
>> Hi Alex,
>> 
>> I believe there are two separate problems with running PCI devices in
>> the vfio-user server. The first one is concerning memory isolation and
>> second one is vectoring of BAR accesses (as explained below).
>> 
>> In our previous patches (v3), we used an IOMMU to isolate memory
>> spaces. But we still had trouble with the vectoring. So we implemented
>> separate address spaces for each PCIBus to tackle both problems
>> simultaneously, based on the feedback we got.
>> 
>> The following gives an overview of issues concerning vectoring of
>> BAR accesses.
>> 
>> The device’s BAR regions are mapped into the guest physical address
>> space. The guest writes the guest PA of each BAR into the device’s BAR
>> registers. To access the BAR regions of the device, QEMU uses
>> address_space_rw() which vectors the physical address access to the
>> device BAR region handlers.
> 
> The guest physical address written to the BAR is irrelevant from the
> device perspective, this only serves to assign the BAR an offset within
> the address_space_mem, which is used by the vCPU (and possibly other
> devices depending on their address space).  There is no reason for the
> device itself to care about this address.

Thank you for the explanation, Alex!

The confusion at my part is whether we are inside the device already when
the server receives a request to access BAR region of a device. Based on
your explanation, I get that your view is the BAR access request has
propagated into the device already, whereas I was under the impression
that the request is still on the CPU side of the PCI root complex.

Your view makes sense to me - once the BAR access request reaches the
client (on the other side), we could consider that the request has reached
the device.

On a separate note, if devices don’t care about the values in BAR
registers, why do the default PCI config handlers intercept and map
the BAR region into address_space_mem?
(pci_default_write_config() -> pci_update_mappings())

Thank you!
--
Jag

> 
>> The PCIBus data structure already has address_space_mem and
>> address_space_io to contain the BAR regions of devices attached
>> to it. I understand that these two PCIBus members form the
>> PCI address space.
> 
> These are the CPU address spaces.  When there's no IOMMU, the PCI bus is
> identity mapped to the CPU address space.  When there is an IOMMU, the
> device address space is determined by the granularity of the IOMMU and
> may be entirely separate from address_space_mem.
> 
> I/O port space is always the identity mapped CPU address space unless
> sparse translations are used to create multiple I/O port spaces (not
> implemented).  I/O port space is only accessed by the CPU, there are no
> device initiated I/O port transactions, so the address space relative
> to the device is irrelevant.
> 
>> Typically, the machines map the PCI address space into the system address
>> space. For example, pc_pci_as_mapping_init() does this for ‘pc' machine types.
>> As such, there is a 1:1 mapping between system address space and PCI address
>> space of the root bus. Since all the PCI devices in the machine are assigned to
>> the same VM, we could map the PCI address space of all PCI buses to the same
>> system address space.
> 
> "Typically" only if we're restricted to the "pc", ie. i440FX, machine
> type since it doesn't support a vIOMMU.  There's no reason to focus on
> the identity map case versus the vIOMMU case.
> 
>> Whereas in the case of vfio-user, the devices running in the server could
>> belong to different VMs. Therefore, along with the physical address, we would
>> need to know the address space that the device belongs for
>> address_space_rw() to successfully vector BAR accesses into the PCI device.
> 
> But as far as device initiated transactions, there is only one address
> space for a given device, it's either address_space_mem or one provided
> by the vIOMMU and pci_device_iommu_address_space() tells us that
> address space.  Furthermore, the device never operates on a "physical
> address", it only ever operates on an IOVA, ie. an offset within the
> address space assigned to the device.  The IOVA should be considered
> arbitrary relative to mappings in any other address spaces.
> 
> Device initiated transactions operate on an IOVA within the (single)
> address space to which the device is assigned.  Any attempt to do
> otherwise violates the isolation put in place by things like vIOMMUs
> and ought to be considered a security concern, especially for a device
> serviced by an external process.  Thanks,
> 
> Alex
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-02  1:13                                 ` Jag Raman
@ 2022-02-02  5:34                                   ` Alex Williamson
  2022-02-02  9:22                                     ` Stefan Hajnoczi
  2022-02-10  0:08                                     ` Jag Raman
  0 siblings, 2 replies; 99+ messages in thread
From: Alex Williamson @ 2022-02-02  5:34 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, Michael S. Tsirkin, qemu-devel,
	armbru, quintela, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé

On Wed, 2 Feb 2022 01:13:22 +0000
Jag Raman <jag.raman@oracle.com> wrote:

> > On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > On Tue, 1 Feb 2022 21:24:08 +0000
> > Jag Raman <jag.raman@oracle.com> wrote:
> >   
> >>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>> 
> >>> On Tue, 1 Feb 2022 09:30:35 +0000
> >>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>   
> >>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
> >>>>> On Fri, 28 Jan 2022 09:18:08 +0000
> >>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>   
> >>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:      
> >>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> >>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?        
> >>>>>> 
> >>>>>> The issue Dave raised is that vfio-user servers run in separate
> >>>>>> processses from QEMU with shared memory access to RAM but no direct
> >>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> >>>>>> requests.
> >>>>>> 
> >>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> >>>>>> protocol already has messages that vfio-user servers can use as a
> >>>>>> fallback when DMA cannot be completed through the shared memory RAM
> >>>>>> accesses.
> >>>>>>   
> >>>>>>> In
> >>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
> >>>>>>> physical hardware properties or specifications could we leverage to
> >>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
> >>>>>>> type to provide consistency between devices?  Should each "isolated"
> >>>>>>> bus be in a separate root complex?  Thanks,        
> >>>>>> 
> >>>>>> There is a separate issue in this patch series regarding isolating the
> >>>>>> address space where BAR accesses are made (i.e. the global
> >>>>>> address_space_memory/io). When one process hosts multiple vfio-user
> >>>>>> server instances (e.g. a software-defined network switch with multiple
> >>>>>> ethernet devices) then each instance needs isolated memory and io address
> >>>>>> spaces so that vfio-user clients don't cause collisions when they map
> >>>>>> BARs to the same address.
> >>>>>> 
> >>>>>> I think the the separate root complex idea is a good solution. This
> >>>>>> patch series takes a different approach by adding the concept of
> >>>>>> isolated address spaces into hw/pci/.      
> >>>>> 
> >>>>> This all still seems pretty sketchy, BARs cannot overlap within the
> >>>>> same vCPU address space, perhaps with the exception of when they're
> >>>>> being sized, but DMA should be disabled during sizing.
> >>>>> 
> >>>>> Devices within the same VM context with identical BARs would need to
> >>>>> operate in different address spaces.  For example a translation offset
> >>>>> in the vCPU address space would allow unique addressing to the devices,
> >>>>> perhaps using the translation offset bits to address a root complex and
> >>>>> masking those bits for downstream transactions.
> >>>>> 
> >>>>> In general, the device simply operates in an address space, ie. an
> >>>>> IOVA.  When a mapping is made within that address space, we perform a
> >>>>> translation as necessary to generate a guest physical address.  The
> >>>>> IOVA itself is only meaningful within the context of the address space,
> >>>>> there is no requirement or expectation for it to be globally unique.
> >>>>> 
> >>>>> If the vfio-user server is making some sort of requirement that IOVAs
> >>>>> are unique across all devices, that seems very, very wrong.  Thanks,      
> >>>> 
> >>>> Yes, BARs and IOVAs don't need to be unique across all devices.
> >>>> 
> >>>> The issue is that there can be as many guest physical address spaces as
> >>>> there are vfio-user clients connected, so per-client isolated address
> >>>> spaces are required. This patch series has a solution to that problem
> >>>> with the new pci_isol_as_mem/io() API.    
> >>> 
> >>> Sorry, this still doesn't follow for me.  A server that hosts multiple
> >>> devices across many VMs (I'm not sure if you're referring to the device
> >>> or the VM as a client) needs to deal with different address spaces per
> >>> device.  The server needs to be able to uniquely identify every DMA,
> >>> which must be part of the interface protocol.  But I don't see how that
> >>> imposes a requirement of an isolated address space.  If we want the
> >>> device isolated because we don't trust the server, that's where an IOMMU
> >>> provides per device isolation.  What is the restriction of the
> >>> per-client isolated address space and why do we need it?  The server
> >>> needing to support multiple clients is not a sufficient answer to
> >>> impose new PCI bus types with an implicit restriction on the VM.    
> >> 
> >> Hi Alex,
> >> 
> >> I believe there are two separate problems with running PCI devices in
> >> the vfio-user server. The first one is concerning memory isolation and
> >> second one is vectoring of BAR accesses (as explained below).
> >> 
> >> In our previous patches (v3), we used an IOMMU to isolate memory
> >> spaces. But we still had trouble with the vectoring. So we implemented
> >> separate address spaces for each PCIBus to tackle both problems
> >> simultaneously, based on the feedback we got.
> >> 
> >> The following gives an overview of issues concerning vectoring of
> >> BAR accesses.
> >> 
> >> The device’s BAR regions are mapped into the guest physical address
> >> space. The guest writes the guest PA of each BAR into the device’s BAR
> >> registers. To access the BAR regions of the device, QEMU uses
> >> address_space_rw() which vectors the physical address access to the
> >> device BAR region handlers.  
> > 
> > The guest physical address written to the BAR is irrelevant from the
> > device perspective, this only serves to assign the BAR an offset within
> > the address_space_mem, which is used by the vCPU (and possibly other
> > devices depending on their address space).  There is no reason for the
> > device itself to care about this address.  
> 
> Thank you for the explanation, Alex!
> 
> The confusion at my part is whether we are inside the device already when
> the server receives a request to access BAR region of a device. Based on
> your explanation, I get that your view is the BAR access request has
> propagated into the device already, whereas I was under the impression
> that the request is still on the CPU side of the PCI root complex.

If you are getting an access through your MemoryRegionOps, all the
translations have been made, you simply need to use the hwaddr as the
offset into the MemoryRegion for the access.  Perform the read/write to
your device, no further translations required.
 
> Your view makes sense to me - once the BAR access request reaches the
> client (on the other side), we could consider that the request has reached
> the device.
> 
> On a separate note, if devices don’t care about the values in BAR
> registers, why do the default PCI config handlers intercept and map
> the BAR region into address_space_mem?
> (pci_default_write_config() -> pci_update_mappings())

This is the part that's actually placing the BAR MemoryRegion as a
sub-region into the vCPU address space.  I think if you track it,
you'll see PCIDevice.io_regions[i].address_space is actually
system_memory, which is used to initialize address_space_system.

The machine assembles PCI devices onto buses as instructed by the
command line or hot plug operations.  It's the responsibility of the
guest firmware and guest OS to probe those devices, size the BARs, and
place the BARs into the memory hierarchy of the PCI bus, ie. system
memory.  The BARs are necessarily in the "guest physical memory" for
vCPU access, but it's essentially only coincidental that PCI devices
might be in an address space that provides a mapping to their own BAR.
There's no reason to ever use it.

In the vIOMMU case, we can't know that the device address space
includes those BAR mappings or if they do, that they're identity mapped
to the physical address.  Devices really need to not infer anything
about an address.  Think about real hardware, a device is told by
driver programming to perform a DMA operation.  The device doesn't know
the target of that operation, it's the guest driver's responsibility to
make sure the IOVA within the device address space is valid and maps to
the desired target.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-02  5:34                                   ` Alex Williamson
@ 2022-02-02  9:22                                     ` Stefan Hajnoczi
  2022-02-10  0:08                                     ` Jag Raman
  1 sibling, 0 replies; 99+ messages in thread
From: Stefan Hajnoczi @ 2022-02-02  9:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	Michael S. Tsirkin, qemu-devel, armbru, quintela, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, thanos.makatos,
	Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 9651 bytes --]

On Tue, Feb 01, 2022 at 10:34:32PM -0700, Alex Williamson wrote:
> On Wed, 2 Feb 2022 01:13:22 +0000
> Jag Raman <jag.raman@oracle.com> wrote:
> 
> > > On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > > On Tue, 1 Feb 2022 21:24:08 +0000
> > > Jag Raman <jag.raman@oracle.com> wrote:
> > >   
> > >>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > >>> 
> > >>> On Tue, 1 Feb 2022 09:30:35 +0000
> > >>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >>>   
> > >>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
> > >>>>> On Fri, 28 Jan 2022 09:18:08 +0000
> > >>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >>>>>   
> > >>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:      
> > >>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> > >>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?        
> > >>>>>> 
> > >>>>>> The issue Dave raised is that vfio-user servers run in separate
> > >>>>>> processses from QEMU with shared memory access to RAM but no direct
> > >>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> > >>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> > >>>>>> requests.
> > >>>>>> 
> > >>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> > >>>>>> protocol already has messages that vfio-user servers can use as a
> > >>>>>> fallback when DMA cannot be completed through the shared memory RAM
> > >>>>>> accesses.
> > >>>>>>   
> > >>>>>>> In
> > >>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> > >>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
> > >>>>>>> physical hardware properties or specifications could we leverage to
> > >>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
> > >>>>>>> type to provide consistency between devices?  Should each "isolated"
> > >>>>>>> bus be in a separate root complex?  Thanks,        
> > >>>>>> 
> > >>>>>> There is a separate issue in this patch series regarding isolating the
> > >>>>>> address space where BAR accesses are made (i.e. the global
> > >>>>>> address_space_memory/io). When one process hosts multiple vfio-user
> > >>>>>> server instances (e.g. a software-defined network switch with multiple
> > >>>>>> ethernet devices) then each instance needs isolated memory and io address
> > >>>>>> spaces so that vfio-user clients don't cause collisions when they map
> > >>>>>> BARs to the same address.
> > >>>>>> 
> > >>>>>> I think the the separate root complex idea is a good solution. This
> > >>>>>> patch series takes a different approach by adding the concept of
> > >>>>>> isolated address spaces into hw/pci/.      
> > >>>>> 
> > >>>>> This all still seems pretty sketchy, BARs cannot overlap within the
> > >>>>> same vCPU address space, perhaps with the exception of when they're
> > >>>>> being sized, but DMA should be disabled during sizing.
> > >>>>> 
> > >>>>> Devices within the same VM context with identical BARs would need to
> > >>>>> operate in different address spaces.  For example a translation offset
> > >>>>> in the vCPU address space would allow unique addressing to the devices,
> > >>>>> perhaps using the translation offset bits to address a root complex and
> > >>>>> masking those bits for downstream transactions.
> > >>>>> 
> > >>>>> In general, the device simply operates in an address space, ie. an
> > >>>>> IOVA.  When a mapping is made within that address space, we perform a
> > >>>>> translation as necessary to generate a guest physical address.  The
> > >>>>> IOVA itself is only meaningful within the context of the address space,
> > >>>>> there is no requirement or expectation for it to be globally unique.
> > >>>>> 
> > >>>>> If the vfio-user server is making some sort of requirement that IOVAs
> > >>>>> are unique across all devices, that seems very, very wrong.  Thanks,      
> > >>>> 
> > >>>> Yes, BARs and IOVAs don't need to be unique across all devices.
> > >>>> 
> > >>>> The issue is that there can be as many guest physical address spaces as
> > >>>> there are vfio-user clients connected, so per-client isolated address
> > >>>> spaces are required. This patch series has a solution to that problem
> > >>>> with the new pci_isol_as_mem/io() API.    
> > >>> 
> > >>> Sorry, this still doesn't follow for me.  A server that hosts multiple
> > >>> devices across many VMs (I'm not sure if you're referring to the device
> > >>> or the VM as a client) needs to deal with different address spaces per
> > >>> device.  The server needs to be able to uniquely identify every DMA,
> > >>> which must be part of the interface protocol.  But I don't see how that
> > >>> imposes a requirement of an isolated address space.  If we want the
> > >>> device isolated because we don't trust the server, that's where an IOMMU
> > >>> provides per device isolation.  What is the restriction of the
> > >>> per-client isolated address space and why do we need it?  The server
> > >>> needing to support multiple clients is not a sufficient answer to
> > >>> impose new PCI bus types with an implicit restriction on the VM.    
> > >> 
> > >> Hi Alex,
> > >> 
> > >> I believe there are two separate problems with running PCI devices in
> > >> the vfio-user server. The first one is concerning memory isolation and
> > >> second one is vectoring of BAR accesses (as explained below).
> > >> 
> > >> In our previous patches (v3), we used an IOMMU to isolate memory
> > >> spaces. But we still had trouble with the vectoring. So we implemented
> > >> separate address spaces for each PCIBus to tackle both problems
> > >> simultaneously, based on the feedback we got.
> > >> 
> > >> The following gives an overview of issues concerning vectoring of
> > >> BAR accesses.
> > >> 
> > >> The device’s BAR regions are mapped into the guest physical address
> > >> space. The guest writes the guest PA of each BAR into the device’s BAR
> > >> registers. To access the BAR regions of the device, QEMU uses
> > >> address_space_rw() which vectors the physical address access to the
> > >> device BAR region handlers.  
> > > 
> > > The guest physical address written to the BAR is irrelevant from the
> > > device perspective, this only serves to assign the BAR an offset within
> > > the address_space_mem, which is used by the vCPU (and possibly other
> > > devices depending on their address space).  There is no reason for the
> > > device itself to care about this address.  
> > 
> > Thank you for the explanation, Alex!
> > 
> > The confusion at my part is whether we are inside the device already when
> > the server receives a request to access BAR region of a device. Based on
> > your explanation, I get that your view is the BAR access request has
> > propagated into the device already, whereas I was under the impression
> > that the request is still on the CPU side of the PCI root complex.
> 
> If you are getting an access through your MemoryRegionOps, all the
> translations have been made, you simply need to use the hwaddr as the
> offset into the MemoryRegion for the access.  Perform the read/write to
> your device, no further translations required.

The access comes via libvfio-user's vfu_region_access_cb_t callback, not
via MemoryRegionOps. The callback is currently implemented by calling
address_space_rw() on the pci_isol_as_mem/io() address space, depending
on the BAR type. The code in "[PATCH v5 15/18] vfio-user: handle PCI BAR
accesses".

It's possible to reimplement the patch to directly call
memory_region_dispatch_read/write() on r->memory instead of
address_space_rw() as you've described.

> > Your view makes sense to me - once the BAR access request reaches the
> > client (on the other side), we could consider that the request has reached
> > the device.
> > 
> > On a separate note, if devices don’t care about the values in BAR
> > registers, why do the default PCI config handlers intercept and map
> > the BAR region into address_space_mem?
> > (pci_default_write_config() -> pci_update_mappings())
> 
> This is the part that's actually placing the BAR MemoryRegion as a
> sub-region into the vCPU address space.  I think if you track it,
> you'll see PCIDevice.io_regions[i].address_space is actually
> system_memory, which is used to initialize address_space_system.
> 
> The machine assembles PCI devices onto buses as instructed by the
> command line or hot plug operations.  It's the responsibility of the
> guest firmware and guest OS to probe those devices, size the BARs, and
> place the BARs into the memory hierarchy of the PCI bus, ie. system
> memory.  The BARs are necessarily in the "guest physical memory" for
> vCPU access, but it's essentially only coincidental that PCI devices
> might be in an address space that provides a mapping to their own BAR.
> There's no reason to ever use it.

Good, I think nothing uses address_space_system/io when BAR dispatch is
implemented with memory_region_dispatch_read/write() as you suggested.

It would be nice if there was a way to poison address_space_system/io to
abort on dispatch - nothing should use them.

We now have the option of dropping pci_isol_as_mem/io() again and using
->iommu_fn() to return an isolated memory address space containing the
vfio-user client's VFIO_USER_DMA_MAP regions.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-01 22:47                               ` Alex Williamson
  2022-02-02  1:13                                 ` Jag Raman
@ 2022-02-02  9:30                                 ` Peter Maydell
  2022-02-02 10:06                                   ` Michael S. Tsirkin
  2022-02-02 17:12                                   ` Alex Williamson
  1 sibling, 2 replies; 99+ messages in thread
From: Peter Maydell @ 2022-02-02  9:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, john.levon,
	John Johnson, Michael S. Tsirkin, qemu-devel, armbru, quintela,
	thanos.makatos, Marc-André Lureau, Stefan Hajnoczi,
	Paolo Bonzini, Daniel P. Berrangé,
	Eric Blake, Dr. David Alan Gilbert, Philippe Mathieu-Daudé

On Tue, 1 Feb 2022 at 23:51, Alex Williamson <alex.williamson@redhat.com> wrote:
>
> On Tue, 1 Feb 2022 21:24:08 +0000
> Jag Raman <jag.raman@oracle.com> wrote:
> > The PCIBus data structure already has address_space_mem and
> > address_space_io to contain the BAR regions of devices attached
> > to it. I understand that these two PCIBus members form the
> > PCI address space.
>
> These are the CPU address spaces.  When there's no IOMMU, the PCI bus is
> identity mapped to the CPU address space.  When there is an IOMMU, the
> device address space is determined by the granularity of the IOMMU and
> may be entirely separate from address_space_mem.

Note that those fields in PCIBus are just whatever MemoryRegions
the pci controller model passed in to the call to pci_root_bus_init()
or equivalent. They may or may not be specifically the CPU's view
of anything. (For instance on the versatilepb board, the PCI controller
is visible to the CPU via several MMIO "windows" at known addresses,
which let the CPU access into the PCI address space at a programmable
offset. We model that by creating a couple of container MRs which
we pass to pci_root_bus_init() to be the PCI memory and IO spaces,
and then using alias MRs to provide the view into those at the
guest-programmed offset. The CPU sees those windows, and doesn't
have direct access to the whole PCIBus::address_space_mem.)
I guess you could say they're the PCI controller's view of the PCI
address space ?

We have a tendency to be a bit sloppy with use of AddressSpaces
within QEMU where it happens that the view of the world that a
DMA-capable device matches that of the CPU, but conceptually
they can definitely be different, especially in the non-x86 world.
(Linux also confuses matters here by preferring to program a 1:1
mapping even if the hardware is more flexible and can do other things.
The model of the h/w in QEMU should support the other cases too, not
just 1:1.)

> I/O port space is always the identity mapped CPU address space unless
> sparse translations are used to create multiple I/O port spaces (not
> implemented).  I/O port space is only accessed by the CPU, there are no
> device initiated I/O port transactions, so the address space relative
> to the device is irrelevant.

Does the PCI spec actually forbid any master except the CPU from
issuing I/O port transactions, or is it just that in practice nobody
makes a PCI device that does weird stuff like that ?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-02  9:30                                 ` Peter Maydell
@ 2022-02-02 10:06                                   ` Michael S. Tsirkin
  2022-02-02 15:49                                     ` Alex Williamson
  2022-02-02 17:12                                   ` Alex Williamson
  1 sibling, 1 reply; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-02-02 10:06 UTC (permalink / raw)
  To: Peter Maydell
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, john.levon,
	John Johnson, quintela, qemu-devel, armbru, Alex Williamson,
	thanos.makatos, Marc-André Lureau, Stefan Hajnoczi,
	Paolo Bonzini, Daniel P. Berrangé,
	Eric Blake, Dr. David Alan Gilbert, Philippe Mathieu-Daudé

On Wed, Feb 02, 2022 at 09:30:42AM +0000, Peter Maydell wrote:
> > I/O port space is always the identity mapped CPU address space unless
> > sparse translations are used to create multiple I/O port spaces (not
> > implemented).  I/O port space is only accessed by the CPU, there are no
> > device initiated I/O port transactions, so the address space relative
> > to the device is irrelevant.
> 
> Does the PCI spec actually forbid any master except the CPU from
> issuing I/O port transactions, or is it just that in practice nobody
> makes a PCI device that does weird stuff like that ?
> 
> thanks
> -- PMM

Hmm, the only thing vaguely related in the spec that I know of is this:

	PCI Express supports I/O Space for compatibility with legacy devices which require their use.
	Future revisions of this specification may deprecate the use of I/O Space.

Alex, what did you refer to?

-- 
MST



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-02 10:06                                   ` Michael S. Tsirkin
@ 2022-02-02 15:49                                     ` Alex Williamson
  2022-02-02 16:53                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 99+ messages in thread
From: Alex Williamson @ 2022-02-02 15:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eduardo, Peter Maydell, Jag Raman, Beraldo Leal, john.levon,
	John Johnson, quintela, qemu-devel, Elena Ufimtseva, armbru,
	thanos.makatos, Marc-André Lureau, Stefan Hajnoczi,
	Paolo Bonzini, Daniel P. Berrangé ,
	Eric Blake, Dr. David Alan Gilbert, Philippe Mathieu-Daudé

On Wed, 2 Feb 2022 05:06:49 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Feb 02, 2022 at 09:30:42AM +0000, Peter Maydell wrote:
> > > I/O port space is always the identity mapped CPU address space unless
> > > sparse translations are used to create multiple I/O port spaces (not
> > > implemented).  I/O port space is only accessed by the CPU, there are no
> > > device initiated I/O port transactions, so the address space relative
> > > to the device is irrelevant.  
> > 
> > Does the PCI spec actually forbid any master except the CPU from
> > issuing I/O port transactions, or is it just that in practice nobody
> > makes a PCI device that does weird stuff like that ?
> > 
> > thanks
> > -- PMM  
> 
> Hmm, the only thing vaguely related in the spec that I know of is this:
> 
> 	PCI Express supports I/O Space for compatibility with legacy devices which require their use.
> 	Future revisions of this specification may deprecate the use of I/O Space.
> 
> Alex, what did you refer to?

My evidence is largely by omission, but that might be that in practice
it's not used rather than explicitly forbidden.  I note that the bus
master enable bit specifies:

	Bus Master Enable - Controls the ability of a Function to issue
		Memory and I/O Read/Write Requests, and the ability of
		a Port to forward Memory and I/O Read/Write Requests in
		the Upstream direction.

That would suggest it's possible, but for PCI device assignment, I'm
not aware of any means through which we could support this.  There is
no support in the IOMMU core for mapping I/O port space, nor could we
trap such device initiated transactions to emulate them.  I can't spot
any mention of I/O port space in the VT-d spec, however the AMD-Vi spec
does include a field in the device table:

	controlIoCtl: port I/O control. Specifies whether
	device-initiated port I/O space transactions are blocked,
	forwarded, or translated.

	00b=Device-initiated port I/O is not allowed. The IOMMU target
	aborts the transaction if a port I/O space transaction is
	received. Translation requests are target aborted.
	
	01b=Device-initiated port I/O space transactions are allowed.
	The IOMMU must pass port I/O accesses untranslated. Translation
	requests are target aborted.
	
	10b=Transactions in the port I/O space address range are
	translated by the IOMMU page tables as memory transactions.

	11b=Reserved.

I don't see this field among the macros used by the Linux driver in
configuring these device entries, so I assume it's left to the default
value, ie. zero, blocking device initiated I/O port transactions.

So yes, I suppose device initiated I/O port transactions are possible,
but we have no support or reason to support them, so I'm going to go
ahead and continue believing any I/O port address space from the device
perspective is largely irrelevant ;)  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-02 15:49                                     ` Alex Williamson
@ 2022-02-02 16:53                                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-02-02 16:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Peter Maydell, Jag Raman, Beraldo Leal, john.levon,
	John Johnson, quintela, qemu-devel, Elena Ufimtseva, armbru,
	thanos.makatos, Marc-André Lureau, Stefan Hajnoczi,
	Paolo Bonzini, Daniel P. Berrangé,
	Eric Blake, Dr. David Alan Gilbert, Philippe Mathieu-Daudé

On Wed, Feb 02, 2022 at 08:49:33AM -0700, Alex Williamson wrote:
> > Alex, what did you refer to?
> 
> My evidence is largely by omission, but that might be that in practice
> it's not used rather than explicitly forbidden.  I note that the bus
> master enable bit specifies:
> 
> 	Bus Master Enable - Controls the ability of a Function to issue
> 		Memory and I/O Read/Write Requests, and the ability of
> 		a Port to forward Memory and I/O Read/Write Requests in
> 		the Upstream direction.
> 
> That would suggest it's possible, but for PCI device assignment, I'm
> not aware of any means through which we could support this.  There is
> no support in the IOMMU core for mapping I/O port space, nor could we
> trap such device initiated transactions to emulate them.  I can't spot
> any mention of I/O port space in the VT-d spec, however the AMD-Vi spec
> does include a field in the device table:
> 
> 	controlIoCtl: port I/O control. Specifies whether
> 	device-initiated port I/O space transactions are blocked,
> 	forwarded, or translated.
> 
> 	00b=Device-initiated port I/O is not allowed. The IOMMU target
> 	aborts the transaction if a port I/O space transaction is
> 	received. Translation requests are target aborted.
> 	
> 	01b=Device-initiated port I/O space transactions are allowed.
> 	The IOMMU must pass port I/O accesses untranslated. Translation
> 	requests are target aborted.
> 	
> 	10b=Transactions in the port I/O space address range are
> 	translated by the IOMMU page tables as memory transactions.
> 
> 	11b=Reserved.
> 
> I don't see this field among the macros used by the Linux driver in
> configuring these device entries, so I assume it's left to the default
> value, ie. zero, blocking device initiated I/O port transactions.
> 
> So yes, I suppose device initiated I/O port transactions are possible,
> but we have no support or reason to support them, so I'm going to go
> ahead and continue believing any I/O port address space from the device
> perspective is largely irrelevant ;)  Thanks,
> 
> Alex

Right, it would seem devices can initiate I/O space transactions but IOMMUs
don't support virtualizing them and so neither does VFIO.


-- 
MST



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-02  9:30                                 ` Peter Maydell
  2022-02-02 10:06                                   ` Michael S. Tsirkin
@ 2022-02-02 17:12                                   ` Alex Williamson
  1 sibling, 0 replies; 99+ messages in thread
From: Alex Williamson @ 2022-02-02 17:12 UTC (permalink / raw)
  To: Peter Maydell
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, john.levon,
	John Johnson, Michael S. Tsirkin, qemu-devel, armbru, quintela,
	thanos.makatos, Marc-André Lureau, Stefan Hajnoczi,
	Paolo Bonzini, Daniel P. Berrangé,
	Eric Blake, Dr. David Alan Gilbert, Philippe Mathieu-Daudé

On Wed, 2 Feb 2022 09:30:42 +0000
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Tue, 1 Feb 2022 at 23:51, Alex Williamson <alex.williamson@redhat.com> wrote:
> >
> > On Tue, 1 Feb 2022 21:24:08 +0000
> > Jag Raman <jag.raman@oracle.com> wrote:  
> > > The PCIBus data structure already has address_space_mem and
> > > address_space_io to contain the BAR regions of devices attached
> > > to it. I understand that these two PCIBus members form the
> > > PCI address space.  
> >
> > These are the CPU address spaces.  When there's no IOMMU, the PCI bus is
> > identity mapped to the CPU address space.  When there is an IOMMU, the
> > device address space is determined by the granularity of the IOMMU and
> > may be entirely separate from address_space_mem.  
> 
> Note that those fields in PCIBus are just whatever MemoryRegions
> the pci controller model passed in to the call to pci_root_bus_init()
> or equivalent. They may or may not be specifically the CPU's view
> of anything. (For instance on the versatilepb board, the PCI controller
> is visible to the CPU via several MMIO "windows" at known addresses,
> which let the CPU access into the PCI address space at a programmable
> offset. We model that by creating a couple of container MRs which
> we pass to pci_root_bus_init() to be the PCI memory and IO spaces,
> and then using alias MRs to provide the view into those at the
> guest-programmed offset. The CPU sees those windows, and doesn't
> have direct access to the whole PCIBus::address_space_mem.)
> I guess you could say they're the PCI controller's view of the PCI
> address space ?

Sure, that's fair.

> We have a tendency to be a bit sloppy with use of AddressSpaces
> within QEMU where it happens that the view of the world that a
> DMA-capable device matches that of the CPU, but conceptually
> they can definitely be different, especially in the non-x86 world.
> (Linux also confuses matters here by preferring to program a 1:1
> mapping even if the hardware is more flexible and can do other things.
> The model of the h/w in QEMU should support the other cases too, not
> just 1:1.)

Right, this is why I prefer to look at the device address space as
simply an IOVA.  The IOVA might be a direct physical address or
coincidental identity mapped physical address via an IOMMU, but none of
that should be the concern of the device.
 
> > I/O port space is always the identity mapped CPU address space unless
> > sparse translations are used to create multiple I/O port spaces (not
> > implemented).  I/O port space is only accessed by the CPU, there are no
> > device initiated I/O port transactions, so the address space relative
> > to the device is irrelevant.  
> 
> Does the PCI spec actually forbid any master except the CPU from
> issuing I/O port transactions, or is it just that in practice nobody
> makes a PCI device that does weird stuff like that ?

As realized in reply to MST, more the latter.  Not used, no point to
enabling, no means to enable depending on the physical IOMMU
implementation.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-02  5:34                                   ` Alex Williamson
  2022-02-02  9:22                                     ` Stefan Hajnoczi
@ 2022-02-10  0:08                                     ` Jag Raman
  2022-02-10  8:02                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 99+ messages in thread
From: Jag Raman @ 2022-02-10  0:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, Michael S. Tsirkin, qemu-devel,
	armbru, quintela, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé



> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Wed, 2 Feb 2022 01:13:22 +0000
> Jag Raman <jag.raman@oracle.com> wrote:
> 
>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>> 
>>> On Tue, 1 Feb 2022 21:24:08 +0000
>>> Jag Raman <jag.raman@oracle.com> wrote:
>>> 
>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>> 
>>>>> On Tue, 1 Feb 2022 09:30:35 +0000
>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> 
>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> 
>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:      
>>>>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?        
>>>>>>>> 
>>>>>>>> The issue Dave raised is that vfio-user servers run in separate
>>>>>>>> processses from QEMU with shared memory access to RAM but no direct
>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
>>>>>>>> requests.
>>>>>>>> 
>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
>>>>>>>> protocol already has messages that vfio-user servers can use as a
>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
>>>>>>>> accesses.
>>>>>>>> 
>>>>>>>>> In
>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
>>>>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
>>>>>>>>> physical hardware properties or specifications could we leverage to
>>>>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
>>>>>>>>> type to provide consistency between devices?  Should each "isolated"
>>>>>>>>> bus be in a separate root complex?  Thanks,        
>>>>>>>> 
>>>>>>>> There is a separate issue in this patch series regarding isolating the
>>>>>>>> address space where BAR accesses are made (i.e. the global
>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
>>>>>>>> server instances (e.g. a software-defined network switch with multiple
>>>>>>>> ethernet devices) then each instance needs isolated memory and io address
>>>>>>>> spaces so that vfio-user clients don't cause collisions when they map
>>>>>>>> BARs to the same address.
>>>>>>>> 
>>>>>>>> I think the the separate root complex idea is a good solution. This
>>>>>>>> patch series takes a different approach by adding the concept of
>>>>>>>> isolated address spaces into hw/pci/.      
>>>>>>> 
>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
>>>>>>> same vCPU address space, perhaps with the exception of when they're
>>>>>>> being sized, but DMA should be disabled during sizing.
>>>>>>> 
>>>>>>> Devices within the same VM context with identical BARs would need to
>>>>>>> operate in different address spaces.  For example a translation offset
>>>>>>> in the vCPU address space would allow unique addressing to the devices,
>>>>>>> perhaps using the translation offset bits to address a root complex and
>>>>>>> masking those bits for downstream transactions.
>>>>>>> 
>>>>>>> In general, the device simply operates in an address space, ie. an
>>>>>>> IOVA.  When a mapping is made within that address space, we perform a
>>>>>>> translation as necessary to generate a guest physical address.  The
>>>>>>> IOVA itself is only meaningful within the context of the address space,
>>>>>>> there is no requirement or expectation for it to be globally unique.
>>>>>>> 
>>>>>>> If the vfio-user server is making some sort of requirement that IOVAs
>>>>>>> are unique across all devices, that seems very, very wrong.  Thanks,      
>>>>>> 
>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
>>>>>> 
>>>>>> The issue is that there can be as many guest physical address spaces as
>>>>>> there are vfio-user clients connected, so per-client isolated address
>>>>>> spaces are required. This patch series has a solution to that problem
>>>>>> with the new pci_isol_as_mem/io() API.    
>>>>> 
>>>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
>>>>> devices across many VMs (I'm not sure if you're referring to the device
>>>>> or the VM as a client) needs to deal with different address spaces per
>>>>> device.  The server needs to be able to uniquely identify every DMA,
>>>>> which must be part of the interface protocol.  But I don't see how that
>>>>> imposes a requirement of an isolated address space.  If we want the
>>>>> device isolated because we don't trust the server, that's where an IOMMU
>>>>> provides per device isolation.  What is the restriction of the
>>>>> per-client isolated address space and why do we need it?  The server
>>>>> needing to support multiple clients is not a sufficient answer to
>>>>> impose new PCI bus types with an implicit restriction on the VM.    
>>>> 
>>>> Hi Alex,
>>>> 
>>>> I believe there are two separate problems with running PCI devices in
>>>> the vfio-user server. The first one is concerning memory isolation and
>>>> second one is vectoring of BAR accesses (as explained below).
>>>> 
>>>> In our previous patches (v3), we used an IOMMU to isolate memory
>>>> spaces. But we still had trouble with the vectoring. So we implemented
>>>> separate address spaces for each PCIBus to tackle both problems
>>>> simultaneously, based on the feedback we got.
>>>> 
>>>> The following gives an overview of issues concerning vectoring of
>>>> BAR accesses.
>>>> 
>>>> The device’s BAR regions are mapped into the guest physical address
>>>> space. The guest writes the guest PA of each BAR into the device’s BAR
>>>> registers. To access the BAR regions of the device, QEMU uses
>>>> address_space_rw() which vectors the physical address access to the
>>>> device BAR region handlers.  
>>> 
>>> The guest physical address written to the BAR is irrelevant from the
>>> device perspective, this only serves to assign the BAR an offset within
>>> the address_space_mem, which is used by the vCPU (and possibly other
>>> devices depending on their address space).  There is no reason for the
>>> device itself to care about this address.  
>> 
>> Thank you for the explanation, Alex!
>> 
>> The confusion at my part is whether we are inside the device already when
>> the server receives a request to access BAR region of a device. Based on
>> your explanation, I get that your view is the BAR access request has
>> propagated into the device already, whereas I was under the impression
>> that the request is still on the CPU side of the PCI root complex.
> 
> If you are getting an access through your MemoryRegionOps, all the
> translations have been made, you simply need to use the hwaddr as the
> offset into the MemoryRegion for the access.  Perform the read/write to
> your device, no further translations required.
> 
>> Your view makes sense to me - once the BAR access request reaches the
>> client (on the other side), we could consider that the request has reached
>> the device.
>> 
>> On a separate note, if devices don’t care about the values in BAR
>> registers, why do the default PCI config handlers intercept and map
>> the BAR region into address_space_mem?
>> (pci_default_write_config() -> pci_update_mappings())
> 
> This is the part that's actually placing the BAR MemoryRegion as a
> sub-region into the vCPU address space.  I think if you track it,
> you'll see PCIDevice.io_regions[i].address_space is actually
> system_memory, which is used to initialize address_space_system.
> 
> The machine assembles PCI devices onto buses as instructed by the
> command line or hot plug operations.  It's the responsibility of the
> guest firmware and guest OS to probe those devices, size the BARs, and
> place the BARs into the memory hierarchy of the PCI bus, ie. system
> memory.  The BARs are necessarily in the "guest physical memory" for
> vCPU access, but it's essentially only coincidental that PCI devices
> might be in an address space that provides a mapping to their own BAR.
> There's no reason to ever use it.
> 
> In the vIOMMU case, we can't know that the device address space
> includes those BAR mappings or if they do, that they're identity mapped
> to the physical address.  Devices really need to not infer anything
> about an address.  Think about real hardware, a device is told by
> driver programming to perform a DMA operation.  The device doesn't know
> the target of that operation, it's the guest driver's responsibility to
> make sure the IOVA within the device address space is valid and maps to
> the desired target.  Thanks,

Thanks for the explanation, Alex. Thanks to everyone else in the thread who
helped to clarify this problem.

We have implemented the memory isolation based on the discussion in the
thread. We will send the patches out shortly.

Devices such as “name" and “e1000” worked fine. But I’d like to note that
the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
which is forbidden when IOMMU is enabled. Specifically, the driver is asking
the device to access other BAR regions by using the BAR address programmed
in the PCI config space. This happens even without vfio-user patches. For example,
we could enable IOMMU using “-device intel-iommu” QEMU option and also
adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
In this case, we could see an IOMMU fault.

Unfortunately, we started off our project with the LSI device. So that lead to all the
confusion about what is expected at the server end in-terms of
vectoring/address-translation. It gave an impression as if the request was still on
the CPU side of the PCI root complex, but the actual problem was with the
device driver itself.

I’m wondering how to deal with this problem. Would it be OK if we mapped the
device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
This would help devices such as LSI to circumvent this problem. One problem
with this approach is that it has the potential to collide with another legitimate
IOVA address. Kindly share your thought on this.

Thank you!
--
Jag

> 
> Alex
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10  0:08                                     ` Jag Raman
@ 2022-02-10  8:02                                       ` Michael S. Tsirkin
  2022-02-10 22:23                                         ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-02-10  8:02 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, quintela, qemu-devel, armbru,
	Alex Williamson, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé

On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:
> 
> 
> > On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > On Wed, 2 Feb 2022 01:13:22 +0000
> > Jag Raman <jag.raman@oracle.com> wrote:
> > 
> >>> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>> 
> >>> On Tue, 1 Feb 2022 21:24:08 +0000
> >>> Jag Raman <jag.raman@oracle.com> wrote:
> >>> 
> >>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>> 
> >>>>> On Tue, 1 Feb 2022 09:30:35 +0000
> >>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>> 
> >>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
> >>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
> >>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>> 
> >>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:      
> >>>>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> >>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?        
> >>>>>>>> 
> >>>>>>>> The issue Dave raised is that vfio-user servers run in separate
> >>>>>>>> processses from QEMU with shared memory access to RAM but no direct
> >>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> >>>>>>>> requests.
> >>>>>>>> 
> >>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> >>>>>>>> protocol already has messages that vfio-user servers can use as a
> >>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
> >>>>>>>> accesses.
> >>>>>>>> 
> >>>>>>>>> In
> >>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>>>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
> >>>>>>>>> physical hardware properties or specifications could we leverage to
> >>>>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
> >>>>>>>>> type to provide consistency between devices?  Should each "isolated"
> >>>>>>>>> bus be in a separate root complex?  Thanks,        
> >>>>>>>> 
> >>>>>>>> There is a separate issue in this patch series regarding isolating the
> >>>>>>>> address space where BAR accesses are made (i.e. the global
> >>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
> >>>>>>>> server instances (e.g. a software-defined network switch with multiple
> >>>>>>>> ethernet devices) then each instance needs isolated memory and io address
> >>>>>>>> spaces so that vfio-user clients don't cause collisions when they map
> >>>>>>>> BARs to the same address.
> >>>>>>>> 
> >>>>>>>> I think the the separate root complex idea is a good solution. This
> >>>>>>>> patch series takes a different approach by adding the concept of
> >>>>>>>> isolated address spaces into hw/pci/.      
> >>>>>>> 
> >>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
> >>>>>>> same vCPU address space, perhaps with the exception of when they're
> >>>>>>> being sized, but DMA should be disabled during sizing.
> >>>>>>> 
> >>>>>>> Devices within the same VM context with identical BARs would need to
> >>>>>>> operate in different address spaces.  For example a translation offset
> >>>>>>> in the vCPU address space would allow unique addressing to the devices,
> >>>>>>> perhaps using the translation offset bits to address a root complex and
> >>>>>>> masking those bits for downstream transactions.
> >>>>>>> 
> >>>>>>> In general, the device simply operates in an address space, ie. an
> >>>>>>> IOVA.  When a mapping is made within that address space, we perform a
> >>>>>>> translation as necessary to generate a guest physical address.  The
> >>>>>>> IOVA itself is only meaningful within the context of the address space,
> >>>>>>> there is no requirement or expectation for it to be globally unique.
> >>>>>>> 
> >>>>>>> If the vfio-user server is making some sort of requirement that IOVAs
> >>>>>>> are unique across all devices, that seems very, very wrong.  Thanks,      
> >>>>>> 
> >>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
> >>>>>> 
> >>>>>> The issue is that there can be as many guest physical address spaces as
> >>>>>> there are vfio-user clients connected, so per-client isolated address
> >>>>>> spaces are required. This patch series has a solution to that problem
> >>>>>> with the new pci_isol_as_mem/io() API.    
> >>>>> 
> >>>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
> >>>>> devices across many VMs (I'm not sure if you're referring to the device
> >>>>> or the VM as a client) needs to deal with different address spaces per
> >>>>> device.  The server needs to be able to uniquely identify every DMA,
> >>>>> which must be part of the interface protocol.  But I don't see how that
> >>>>> imposes a requirement of an isolated address space.  If we want the
> >>>>> device isolated because we don't trust the server, that's where an IOMMU
> >>>>> provides per device isolation.  What is the restriction of the
> >>>>> per-client isolated address space and why do we need it?  The server
> >>>>> needing to support multiple clients is not a sufficient answer to
> >>>>> impose new PCI bus types with an implicit restriction on the VM.    
> >>>> 
> >>>> Hi Alex,
> >>>> 
> >>>> I believe there are two separate problems with running PCI devices in
> >>>> the vfio-user server. The first one is concerning memory isolation and
> >>>> second one is vectoring of BAR accesses (as explained below).
> >>>> 
> >>>> In our previous patches (v3), we used an IOMMU to isolate memory
> >>>> spaces. But we still had trouble with the vectoring. So we implemented
> >>>> separate address spaces for each PCIBus to tackle both problems
> >>>> simultaneously, based on the feedback we got.
> >>>> 
> >>>> The following gives an overview of issues concerning vectoring of
> >>>> BAR accesses.
> >>>> 
> >>>> The device’s BAR regions are mapped into the guest physical address
> >>>> space. The guest writes the guest PA of each BAR into the device’s BAR
> >>>> registers. To access the BAR regions of the device, QEMU uses
> >>>> address_space_rw() which vectors the physical address access to the
> >>>> device BAR region handlers.  
> >>> 
> >>> The guest physical address written to the BAR is irrelevant from the
> >>> device perspective, this only serves to assign the BAR an offset within
> >>> the address_space_mem, which is used by the vCPU (and possibly other
> >>> devices depending on their address space).  There is no reason for the
> >>> device itself to care about this address.  
> >> 
> >> Thank you for the explanation, Alex!
> >> 
> >> The confusion at my part is whether we are inside the device already when
> >> the server receives a request to access BAR region of a device. Based on
> >> your explanation, I get that your view is the BAR access request has
> >> propagated into the device already, whereas I was under the impression
> >> that the request is still on the CPU side of the PCI root complex.
> > 
> > If you are getting an access through your MemoryRegionOps, all the
> > translations have been made, you simply need to use the hwaddr as the
> > offset into the MemoryRegion for the access.  Perform the read/write to
> > your device, no further translations required.
> > 
> >> Your view makes sense to me - once the BAR access request reaches the
> >> client (on the other side), we could consider that the request has reached
> >> the device.
> >> 
> >> On a separate note, if devices don’t care about the values in BAR
> >> registers, why do the default PCI config handlers intercept and map
> >> the BAR region into address_space_mem?
> >> (pci_default_write_config() -> pci_update_mappings())
> > 
> > This is the part that's actually placing the BAR MemoryRegion as a
> > sub-region into the vCPU address space.  I think if you track it,
> > you'll see PCIDevice.io_regions[i].address_space is actually
> > system_memory, which is used to initialize address_space_system.
> > 
> > The machine assembles PCI devices onto buses as instructed by the
> > command line or hot plug operations.  It's the responsibility of the
> > guest firmware and guest OS to probe those devices, size the BARs, and
> > place the BARs into the memory hierarchy of the PCI bus, ie. system
> > memory.  The BARs are necessarily in the "guest physical memory" for
> > vCPU access, but it's essentially only coincidental that PCI devices
> > might be in an address space that provides a mapping to their own BAR.
> > There's no reason to ever use it.
> > 
> > In the vIOMMU case, we can't know that the device address space
> > includes those BAR mappings or if they do, that they're identity mapped
> > to the physical address.  Devices really need to not infer anything
> > about an address.  Think about real hardware, a device is told by
> > driver programming to perform a DMA operation.  The device doesn't know
> > the target of that operation, it's the guest driver's responsibility to
> > make sure the IOVA within the device address space is valid and maps to
> > the desired target.  Thanks,
> 
> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> helped to clarify this problem.
> 
> We have implemented the memory isolation based on the discussion in the
> thread. We will send the patches out shortly.
> 
> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
> the device to access other BAR regions by using the BAR address programmed
> in the PCI config space. This happens even without vfio-user patches. For example,
> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
> In this case, we could see an IOMMU fault.

So, device accessing its own BAR is different. Basically, these
transactions never go on the bus at all, never mind get to the IOMMU.
I think it's just used as a handle to address internal device memory.
This kind of trick is not universal, but not terribly unusual.


> Unfortunately, we started off our project with the LSI device. So that lead to all the
> confusion about what is expected at the server end in-terms of
> vectoring/address-translation. It gave an impression as if the request was still on
> the CPU side of the PCI root complex, but the actual problem was with the
> device driver itself.
> 
> I’m wondering how to deal with this problem. Would it be OK if we mapped the
> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
> This would help devices such as LSI to circumvent this problem. One problem
> with this approach is that it has the potential to collide with another legitimate
> IOVA address. Kindly share your thought on this.
> 
> Thank you!

I am not 100% sure what do you plan to do but it sounds fine since even
if it collides, with traditional PCI device must never initiate cycles
within their own BAR range, and PCIe is software-compatible with PCI. So
devices won't be able to access this IOVA even if it was programmed in
the IOMMU.

As was mentioned elsewhere on this thread, devices accessing each
other's BAR is a different matter.

I do not remember which rules apply to multiple functions of a
multi-function device though. I think in a traditional PCI
they will never go out on the bus, but with e.g. SRIOV they
would probably do go out? Alex, any idea?


> --
> Jag
> 
> > 
> > Alex
> > 
> 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10  8:02                                       ` Michael S. Tsirkin
@ 2022-02-10 22:23                                         ` Jag Raman
  2022-02-10 22:53                                           ` Michael S. Tsirkin
  2022-02-10 23:17                                           ` Alex Williamson
  0 siblings, 2 replies; 99+ messages in thread
From: Jag Raman @ 2022-02-10 22:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, quintela, qemu-devel, armbru,
	Alex Williamson, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé



> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:
>> 
>> 
>>> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>> 
>>> On Wed, 2 Feb 2022 01:13:22 +0000
>>> Jag Raman <jag.raman@oracle.com> wrote:
>>> 
>>>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>> 
>>>>> On Tue, 1 Feb 2022 21:24:08 +0000
>>>>> Jag Raman <jag.raman@oracle.com> wrote:
>>>>> 
>>>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Tue, 1 Feb 2022 09:30:35 +0000
>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> 
>>>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
>>>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
>>>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:      
>>>>>>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
>>>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?        
>>>>>>>>>> 
>>>>>>>>>> The issue Dave raised is that vfio-user servers run in separate
>>>>>>>>>> processses from QEMU with shared memory access to RAM but no direct
>>>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
>>>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
>>>>>>>>>> requests.
>>>>>>>>>> 
>>>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
>>>>>>>>>> protocol already has messages that vfio-user servers can use as a
>>>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
>>>>>>>>>> accesses.
>>>>>>>>>> 
>>>>>>>>>>> In
>>>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
>>>>>>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
>>>>>>>>>>> physical hardware properties or specifications could we leverage to
>>>>>>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
>>>>>>>>>>> type to provide consistency between devices?  Should each "isolated"
>>>>>>>>>>> bus be in a separate root complex?  Thanks,        
>>>>>>>>>> 
>>>>>>>>>> There is a separate issue in this patch series regarding isolating the
>>>>>>>>>> address space where BAR accesses are made (i.e. the global
>>>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
>>>>>>>>>> server instances (e.g. a software-defined network switch with multiple
>>>>>>>>>> ethernet devices) then each instance needs isolated memory and io address
>>>>>>>>>> spaces so that vfio-user clients don't cause collisions when they map
>>>>>>>>>> BARs to the same address.
>>>>>>>>>> 
>>>>>>>>>> I think the the separate root complex idea is a good solution. This
>>>>>>>>>> patch series takes a different approach by adding the concept of
>>>>>>>>>> isolated address spaces into hw/pci/.      
>>>>>>>>> 
>>>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
>>>>>>>>> same vCPU address space, perhaps with the exception of when they're
>>>>>>>>> being sized, but DMA should be disabled during sizing.
>>>>>>>>> 
>>>>>>>>> Devices within the same VM context with identical BARs would need to
>>>>>>>>> operate in different address spaces.  For example a translation offset
>>>>>>>>> in the vCPU address space would allow unique addressing to the devices,
>>>>>>>>> perhaps using the translation offset bits to address a root complex and
>>>>>>>>> masking those bits for downstream transactions.
>>>>>>>>> 
>>>>>>>>> In general, the device simply operates in an address space, ie. an
>>>>>>>>> IOVA.  When a mapping is made within that address space, we perform a
>>>>>>>>> translation as necessary to generate a guest physical address.  The
>>>>>>>>> IOVA itself is only meaningful within the context of the address space,
>>>>>>>>> there is no requirement or expectation for it to be globally unique.
>>>>>>>>> 
>>>>>>>>> If the vfio-user server is making some sort of requirement that IOVAs
>>>>>>>>> are unique across all devices, that seems very, very wrong.  Thanks,      
>>>>>>>> 
>>>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
>>>>>>>> 
>>>>>>>> The issue is that there can be as many guest physical address spaces as
>>>>>>>> there are vfio-user clients connected, so per-client isolated address
>>>>>>>> spaces are required. This patch series has a solution to that problem
>>>>>>>> with the new pci_isol_as_mem/io() API.    
>>>>>>> 
>>>>>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
>>>>>>> devices across many VMs (I'm not sure if you're referring to the device
>>>>>>> or the VM as a client) needs to deal with different address spaces per
>>>>>>> device.  The server needs to be able to uniquely identify every DMA,
>>>>>>> which must be part of the interface protocol.  But I don't see how that
>>>>>>> imposes a requirement of an isolated address space.  If we want the
>>>>>>> device isolated because we don't trust the server, that's where an IOMMU
>>>>>>> provides per device isolation.  What is the restriction of the
>>>>>>> per-client isolated address space and why do we need it?  The server
>>>>>>> needing to support multiple clients is not a sufficient answer to
>>>>>>> impose new PCI bus types with an implicit restriction on the VM.    
>>>>>> 
>>>>>> Hi Alex,
>>>>>> 
>>>>>> I believe there are two separate problems with running PCI devices in
>>>>>> the vfio-user server. The first one is concerning memory isolation and
>>>>>> second one is vectoring of BAR accesses (as explained below).
>>>>>> 
>>>>>> In our previous patches (v3), we used an IOMMU to isolate memory
>>>>>> spaces. But we still had trouble with the vectoring. So we implemented
>>>>>> separate address spaces for each PCIBus to tackle both problems
>>>>>> simultaneously, based on the feedback we got.
>>>>>> 
>>>>>> The following gives an overview of issues concerning vectoring of
>>>>>> BAR accesses.
>>>>>> 
>>>>>> The device’s BAR regions are mapped into the guest physical address
>>>>>> space. The guest writes the guest PA of each BAR into the device’s BAR
>>>>>> registers. To access the BAR regions of the device, QEMU uses
>>>>>> address_space_rw() which vectors the physical address access to the
>>>>>> device BAR region handlers.  
>>>>> 
>>>>> The guest physical address written to the BAR is irrelevant from the
>>>>> device perspective, this only serves to assign the BAR an offset within
>>>>> the address_space_mem, which is used by the vCPU (and possibly other
>>>>> devices depending on their address space).  There is no reason for the
>>>>> device itself to care about this address.  
>>>> 
>>>> Thank you for the explanation, Alex!
>>>> 
>>>> The confusion at my part is whether we are inside the device already when
>>>> the server receives a request to access BAR region of a device. Based on
>>>> your explanation, I get that your view is the BAR access request has
>>>> propagated into the device already, whereas I was under the impression
>>>> that the request is still on the CPU side of the PCI root complex.
>>> 
>>> If you are getting an access through your MemoryRegionOps, all the
>>> translations have been made, you simply need to use the hwaddr as the
>>> offset into the MemoryRegion for the access.  Perform the read/write to
>>> your device, no further translations required.
>>> 
>>>> Your view makes sense to me - once the BAR access request reaches the
>>>> client (on the other side), we could consider that the request has reached
>>>> the device.
>>>> 
>>>> On a separate note, if devices don’t care about the values in BAR
>>>> registers, why do the default PCI config handlers intercept and map
>>>> the BAR region into address_space_mem?
>>>> (pci_default_write_config() -> pci_update_mappings())
>>> 
>>> This is the part that's actually placing the BAR MemoryRegion as a
>>> sub-region into the vCPU address space.  I think if you track it,
>>> you'll see PCIDevice.io_regions[i].address_space is actually
>>> system_memory, which is used to initialize address_space_system.
>>> 
>>> The machine assembles PCI devices onto buses as instructed by the
>>> command line or hot plug operations.  It's the responsibility of the
>>> guest firmware and guest OS to probe those devices, size the BARs, and
>>> place the BARs into the memory hierarchy of the PCI bus, ie. system
>>> memory.  The BARs are necessarily in the "guest physical memory" for
>>> vCPU access, but it's essentially only coincidental that PCI devices
>>> might be in an address space that provides a mapping to their own BAR.
>>> There's no reason to ever use it.
>>> 
>>> In the vIOMMU case, we can't know that the device address space
>>> includes those BAR mappings or if they do, that they're identity mapped
>>> to the physical address.  Devices really need to not infer anything
>>> about an address.  Think about real hardware, a device is told by
>>> driver programming to perform a DMA operation.  The device doesn't know
>>> the target of that operation, it's the guest driver's responsibility to
>>> make sure the IOVA within the device address space is valid and maps to
>>> the desired target.  Thanks,
>> 
>> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
>> helped to clarify this problem.
>> 
>> We have implemented the memory isolation based on the discussion in the
>> thread. We will send the patches out shortly.
>> 
>> Devices such as “name" and “e1000” worked fine. But I’d like to note that
>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
>> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
>> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
>> the device to access other BAR regions by using the BAR address programmed
>> in the PCI config space. This happens even without vfio-user patches. For example,
>> we could enable IOMMU using “-device intel-iommu” QEMU option and also
>> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
>> In this case, we could see an IOMMU fault.
> 
> So, device accessing its own BAR is different. Basically, these
> transactions never go on the bus at all, never mind get to the IOMMU.

Hi Michael,

In LSI case, I did notice that it went to the IOMMU. The device is reading the BAR
address as if it was a DMA address.

> I think it's just used as a handle to address internal device memory.
> This kind of trick is not universal, but not terribly unusual.
> 
> 
>> Unfortunately, we started off our project with the LSI device. So that lead to all the
>> confusion about what is expected at the server end in-terms of
>> vectoring/address-translation. It gave an impression as if the request was still on
>> the CPU side of the PCI root complex, but the actual problem was with the
>> device driver itself.
>> 
>> I’m wondering how to deal with this problem. Would it be OK if we mapped the
>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
>> This would help devices such as LSI to circumvent this problem. One problem
>> with this approach is that it has the potential to collide with another legitimate
>> IOVA address. Kindly share your thought on this.
>> 
>> Thank you!
> 
> I am not 100% sure what do you plan to do but it sounds fine since even
> if it collides, with traditional PCI device must never initiate cycles

OK sounds good, I’ll create a mapping of the device BARs in the IOVA.

Thank you!
--
Jag

> within their own BAR range, and PCIe is software-compatible with PCI. So
> devices won't be able to access this IOVA even if it was programmed in
> the IOMMU.
> 
> As was mentioned elsewhere on this thread, devices accessing each
> other's BAR is a different matter.
> 
> I do not remember which rules apply to multiple functions of a
> multi-function device though. I think in a traditional PCI
> they will never go out on the bus, but with e.g. SRIOV they
> would probably do go out? Alex, any idea?
> 
> 
>> --
>> Jag
>> 
>>> 
>>> Alex
>>> 
>> 
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10 22:23                                         ` Jag Raman
@ 2022-02-10 22:53                                           ` Michael S. Tsirkin
  2022-02-10 23:46                                             ` Jag Raman
  2022-02-10 23:17                                           ` Alex Williamson
  1 sibling, 1 reply; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-02-10 22:53 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, quintela, qemu-devel, armbru,
	Alex Williamson, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé

On Thu, Feb 10, 2022 at 10:23:01PM +0000, Jag Raman wrote:
> 
> 
> > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:
> >> 
> >> 
> >>> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>> 
> >>> On Wed, 2 Feb 2022 01:13:22 +0000
> >>> Jag Raman <jag.raman@oracle.com> wrote:
> >>> 
> >>>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>> 
> >>>>> On Tue, 1 Feb 2022 21:24:08 +0000
> >>>>> Jag Raman <jag.raman@oracle.com> wrote:
> >>>>> 
> >>>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> On Tue, 1 Feb 2022 09:30:35 +0000
> >>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>> 
> >>>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
> >>>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
> >>>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>>>> 
> >>>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:      
> >>>>>>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
> >>>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?        
> >>>>>>>>>> 
> >>>>>>>>>> The issue Dave raised is that vfio-user servers run in separate
> >>>>>>>>>> processses from QEMU with shared memory access to RAM but no direct
> >>>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >>>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
> >>>>>>>>>> requests.
> >>>>>>>>>> 
> >>>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
> >>>>>>>>>> protocol already has messages that vfio-user servers can use as a
> >>>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
> >>>>>>>>>> accesses.
> >>>>>>>>>> 
> >>>>>>>>>>> In
> >>>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>>>>>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
> >>>>>>>>>>> physical hardware properties or specifications could we leverage to
> >>>>>>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
> >>>>>>>>>>> type to provide consistency between devices?  Should each "isolated"
> >>>>>>>>>>> bus be in a separate root complex?  Thanks,        
> >>>>>>>>>> 
> >>>>>>>>>> There is a separate issue in this patch series regarding isolating the
> >>>>>>>>>> address space where BAR accesses are made (i.e. the global
> >>>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
> >>>>>>>>>> server instances (e.g. a software-defined network switch with multiple
> >>>>>>>>>> ethernet devices) then each instance needs isolated memory and io address
> >>>>>>>>>> spaces so that vfio-user clients don't cause collisions when they map
> >>>>>>>>>> BARs to the same address.
> >>>>>>>>>> 
> >>>>>>>>>> I think the the separate root complex idea is a good solution. This
> >>>>>>>>>> patch series takes a different approach by adding the concept of
> >>>>>>>>>> isolated address spaces into hw/pci/.      
> >>>>>>>>> 
> >>>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
> >>>>>>>>> same vCPU address space, perhaps with the exception of when they're
> >>>>>>>>> being sized, but DMA should be disabled during sizing.
> >>>>>>>>> 
> >>>>>>>>> Devices within the same VM context with identical BARs would need to
> >>>>>>>>> operate in different address spaces.  For example a translation offset
> >>>>>>>>> in the vCPU address space would allow unique addressing to the devices,
> >>>>>>>>> perhaps using the translation offset bits to address a root complex and
> >>>>>>>>> masking those bits for downstream transactions.
> >>>>>>>>> 
> >>>>>>>>> In general, the device simply operates in an address space, ie. an
> >>>>>>>>> IOVA.  When a mapping is made within that address space, we perform a
> >>>>>>>>> translation as necessary to generate a guest physical address.  The
> >>>>>>>>> IOVA itself is only meaningful within the context of the address space,
> >>>>>>>>> there is no requirement or expectation for it to be globally unique.
> >>>>>>>>> 
> >>>>>>>>> If the vfio-user server is making some sort of requirement that IOVAs
> >>>>>>>>> are unique across all devices, that seems very, very wrong.  Thanks,      
> >>>>>>>> 
> >>>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
> >>>>>>>> 
> >>>>>>>> The issue is that there can be as many guest physical address spaces as
> >>>>>>>> there are vfio-user clients connected, so per-client isolated address
> >>>>>>>> spaces are required. This patch series has a solution to that problem
> >>>>>>>> with the new pci_isol_as_mem/io() API.    
> >>>>>>> 
> >>>>>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
> >>>>>>> devices across many VMs (I'm not sure if you're referring to the device
> >>>>>>> or the VM as a client) needs to deal with different address spaces per
> >>>>>>> device.  The server needs to be able to uniquely identify every DMA,
> >>>>>>> which must be part of the interface protocol.  But I don't see how that
> >>>>>>> imposes a requirement of an isolated address space.  If we want the
> >>>>>>> device isolated because we don't trust the server, that's where an IOMMU
> >>>>>>> provides per device isolation.  What is the restriction of the
> >>>>>>> per-client isolated address space and why do we need it?  The server
> >>>>>>> needing to support multiple clients is not a sufficient answer to
> >>>>>>> impose new PCI bus types with an implicit restriction on the VM.    
> >>>>>> 
> >>>>>> Hi Alex,
> >>>>>> 
> >>>>>> I believe there are two separate problems with running PCI devices in
> >>>>>> the vfio-user server. The first one is concerning memory isolation and
> >>>>>> second one is vectoring of BAR accesses (as explained below).
> >>>>>> 
> >>>>>> In our previous patches (v3), we used an IOMMU to isolate memory
> >>>>>> spaces. But we still had trouble with the vectoring. So we implemented
> >>>>>> separate address spaces for each PCIBus to tackle both problems
> >>>>>> simultaneously, based on the feedback we got.
> >>>>>> 
> >>>>>> The following gives an overview of issues concerning vectoring of
> >>>>>> BAR accesses.
> >>>>>> 
> >>>>>> The device’s BAR regions are mapped into the guest physical address
> >>>>>> space. The guest writes the guest PA of each BAR into the device’s BAR
> >>>>>> registers. To access the BAR regions of the device, QEMU uses
> >>>>>> address_space_rw() which vectors the physical address access to the
> >>>>>> device BAR region handlers.  
> >>>>> 
> >>>>> The guest physical address written to the BAR is irrelevant from the
> >>>>> device perspective, this only serves to assign the BAR an offset within
> >>>>> the address_space_mem, which is used by the vCPU (and possibly other
> >>>>> devices depending on their address space).  There is no reason for the
> >>>>> device itself to care about this address.  
> >>>> 
> >>>> Thank you for the explanation, Alex!
> >>>> 
> >>>> The confusion at my part is whether we are inside the device already when
> >>>> the server receives a request to access BAR region of a device. Based on
> >>>> your explanation, I get that your view is the BAR access request has
> >>>> propagated into the device already, whereas I was under the impression
> >>>> that the request is still on the CPU side of the PCI root complex.
> >>> 
> >>> If you are getting an access through your MemoryRegionOps, all the
> >>> translations have been made, you simply need to use the hwaddr as the
> >>> offset into the MemoryRegion for the access.  Perform the read/write to
> >>> your device, no further translations required.
> >>> 
> >>>> Your view makes sense to me - once the BAR access request reaches the
> >>>> client (on the other side), we could consider that the request has reached
> >>>> the device.
> >>>> 
> >>>> On a separate note, if devices don’t care about the values in BAR
> >>>> registers, why do the default PCI config handlers intercept and map
> >>>> the BAR region into address_space_mem?
> >>>> (pci_default_write_config() -> pci_update_mappings())
> >>> 
> >>> This is the part that's actually placing the BAR MemoryRegion as a
> >>> sub-region into the vCPU address space.  I think if you track it,
> >>> you'll see PCIDevice.io_regions[i].address_space is actually
> >>> system_memory, which is used to initialize address_space_system.
> >>> 
> >>> The machine assembles PCI devices onto buses as instructed by the
> >>> command line or hot plug operations.  It's the responsibility of the
> >>> guest firmware and guest OS to probe those devices, size the BARs, and
> >>> place the BARs into the memory hierarchy of the PCI bus, ie. system
> >>> memory.  The BARs are necessarily in the "guest physical memory" for
> >>> vCPU access, but it's essentially only coincidental that PCI devices
> >>> might be in an address space that provides a mapping to their own BAR.
> >>> There's no reason to ever use it.
> >>> 
> >>> In the vIOMMU case, we can't know that the device address space
> >>> includes those BAR mappings or if they do, that they're identity mapped
> >>> to the physical address.  Devices really need to not infer anything
> >>> about an address.  Think about real hardware, a device is told by
> >>> driver programming to perform a DMA operation.  The device doesn't know
> >>> the target of that operation, it's the guest driver's responsibility to
> >>> make sure the IOVA within the device address space is valid and maps to
> >>> the desired target.  Thanks,
> >> 
> >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> >> helped to clarify this problem.
> >> 
> >> We have implemented the memory isolation based on the discussion in the
> >> thread. We will send the patches out shortly.
> >> 
> >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> >> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
> >> the device to access other BAR regions by using the BAR address programmed
> >> in the PCI config space. This happens even without vfio-user patches. For example,
> >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> >> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
> >> In this case, we could see an IOMMU fault.
> > 
> > So, device accessing its own BAR is different. Basically, these
> > transactions never go on the bus at all, never mind get to the IOMMU.
> 
> Hi Michael,
> 
> In LSI case, I did notice that it went to the IOMMU.

Hmm do you mean you analyzed how a physical device works?
Or do you mean in QEMU?

> The device is reading the BAR
> address as if it was a DMA address.

I got that, my understanding of PCI was that a device can
not be both a master and a target of a transaction at
the same time though. Could not find this in the spec though,
maybe I remember incorrectly.

> > I think it's just used as a handle to address internal device memory.
> > This kind of trick is not universal, but not terribly unusual.
> > 
> > 
> >> Unfortunately, we started off our project with the LSI device. So that lead to all the
> >> confusion about what is expected at the server end in-terms of
> >> vectoring/address-translation. It gave an impression as if the request was still on
> >> the CPU side of the PCI root complex, but the actual problem was with the
> >> device driver itself.
> >> 
> >> I’m wondering how to deal with this problem. Would it be OK if we mapped the
> >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
> >> This would help devices such as LSI to circumvent this problem. One problem
> >> with this approach is that it has the potential to collide with another legitimate
> >> IOVA address. Kindly share your thought on this.
> >> 
> >> Thank you!
> > 
> > I am not 100% sure what do you plan to do but it sounds fine since even
> > if it collides, with traditional PCI device must never initiate cycles
> 
> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
> 
> Thank you!
> --
> Jag
> 
> > within their own BAR range, and PCIe is software-compatible with PCI. So
> > devices won't be able to access this IOVA even if it was programmed in
> > the IOMMU.
> > 
> > As was mentioned elsewhere on this thread, devices accessing each
> > other's BAR is a different matter.
> > 
> > I do not remember which rules apply to multiple functions of a
> > multi-function device though. I think in a traditional PCI
> > they will never go out on the bus, but with e.g. SRIOV they
> > would probably do go out? Alex, any idea?
> > 
> > 
> >> --
> >> Jag
> >> 
> >>> 
> >>> Alex
> >>> 
> >> 
> > 
> 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10 22:23                                         ` Jag Raman
  2022-02-10 22:53                                           ` Michael S. Tsirkin
@ 2022-02-10 23:17                                           ` Alex Williamson
  2022-02-10 23:28                                             ` Michael S. Tsirkin
  2022-02-11  0:10                                             ` Jag Raman
  1 sibling, 2 replies; 99+ messages in thread
From: Alex Williamson @ 2022-02-10 23:17 UTC (permalink / raw)
  To: Jag Raman
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, Michael S. Tsirkin, qemu-devel,
	armbru, quintela, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé

On Thu, 10 Feb 2022 22:23:01 +0000
Jag Raman <jag.raman@oracle.com> wrote:

> > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:  
> >> 
> >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> >> helped to clarify this problem.
> >> 
> >> We have implemented the memory isolation based on the discussion in the
> >> thread. We will send the patches out shortly.
> >> 
> >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> >> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
> >> the device to access other BAR regions by using the BAR address programmed
> >> in the PCI config space. This happens even without vfio-user patches. For example,
> >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> >> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
> >> In this case, we could see an IOMMU fault.  
> > 
> > So, device accessing its own BAR is different. Basically, these
> > transactions never go on the bus at all, never mind get to the IOMMU.  
> 
> Hi Michael,
> 
> In LSI case, I did notice that it went to the IOMMU. The device is reading the BAR
> address as if it was a DMA address.
> 
> > I think it's just used as a handle to address internal device memory.
> > This kind of trick is not universal, but not terribly unusual.
> > 
> >   
> >> Unfortunately, we started off our project with the LSI device. So that lead to all the
> >> confusion about what is expected at the server end in-terms of
> >> vectoring/address-translation. It gave an impression as if the request was still on
> >> the CPU side of the PCI root complex, but the actual problem was with the
> >> device driver itself.
> >> 
> >> I’m wondering how to deal with this problem. Would it be OK if we mapped the
> >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
> >> This would help devices such as LSI to circumvent this problem. One problem
> >> with this approach is that it has the potential to collide with another legitimate
> >> IOVA address. Kindly share your thought on this.
> >> 
> >> Thank you!  
> > 
> > I am not 100% sure what do you plan to do but it sounds fine since even
> > if it collides, with traditional PCI device must never initiate cycles  
> 
> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.

I don't think this is correct.  Look for instance at ACPI _TRA support
where a system can specify a translation offset such that, for example,
a CPU access to a device is required to add the provided offset to the
bus address of the device.  A system using this could have multiple
root bridges, where each is given the same, overlapping MMIO aperture.
From the processor perspective, each MMIO range is unique and possibly
none of those devices have a zero _TRA, there could be system memory at
the equivalent flat memory address.

So if the transaction actually hits this bus, which I think is what
making use of the device AddressSpace implies, I don't think it can
assume that it's simply reflected back at itself.  Conventional PCI and
PCI Express may be software compatible, but there's a reason we don't
see IOMMUs that provide both translation and isolation in conventional
topologies.

Is this more a bug in the LSI device emulation model?  For instance in
vfio-pci, if I want to access an offset into a BAR from within QEMU, I
don't care what address is programmed into that BAR, I perform an
access relative to the vfio file descriptor region representing that
BAR space.  I'd expect that any viable device emulation model does the
same, an access to device memory uses an offset from an internal
resource, irrespective of the BAR address.

It would seem strange if the driver is actually programming the device
to DMA to itself and if that's actually happening, I'd wonder if this
driver is actually compatible with an IOMMU on bare metal.

> > within their own BAR range, and PCIe is software-compatible with PCI. So
> > devices won't be able to access this IOVA even if it was programmed in
> > the IOMMU.
> > 
> > As was mentioned elsewhere on this thread, devices accessing each
> > other's BAR is a different matter.
> > 
> > I do not remember which rules apply to multiple functions of a
> > multi-function device though. I think in a traditional PCI
> > they will never go out on the bus, but with e.g. SRIOV they
> > would probably do go out? Alex, any idea?

This falls under implementation specific behavior in the spec, IIRC.
This is actually why IOMMU grouping requires ACS support on
multi-function devices to clarify the behavior of p2p between functions
in the same slot.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10 23:17                                           ` Alex Williamson
@ 2022-02-10 23:28                                             ` Michael S. Tsirkin
  2022-02-10 23:49                                               ` Alex Williamson
  2022-02-11  0:10                                             ` Jag Raman
  1 sibling, 1 reply; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-02-10 23:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	quintela, qemu-devel, armbru, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, Stefan Hajnoczi,
	thanos.makatos, Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
> On Thu, 10 Feb 2022 22:23:01 +0000
> Jag Raman <jag.raman@oracle.com> wrote:
> 
> > > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > 
> > > On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:  
> > >> 
> > >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> > >> helped to clarify this problem.
> > >> 
> > >> We have implemented the memory isolation based on the discussion in the
> > >> thread. We will send the patches out shortly.
> > >> 
> > >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> > >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> > >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> > >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> > >> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
> > >> the device to access other BAR regions by using the BAR address programmed
> > >> in the PCI config space. This happens even without vfio-user patches. For example,
> > >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> > >> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
> > >> In this case, we could see an IOMMU fault.  
> > > 
> > > So, device accessing its own BAR is different. Basically, these
> > > transactions never go on the bus at all, never mind get to the IOMMU.  
> > 
> > Hi Michael,
> > 
> > In LSI case, I did notice that it went to the IOMMU. The device is reading the BAR
> > address as if it was a DMA address.
> > 
> > > I think it's just used as a handle to address internal device memory.
> > > This kind of trick is not universal, but not terribly unusual.
> > > 
> > >   
> > >> Unfortunately, we started off our project with the LSI device. So that lead to all the
> > >> confusion about what is expected at the server end in-terms of
> > >> vectoring/address-translation. It gave an impression as if the request was still on
> > >> the CPU side of the PCI root complex, but the actual problem was with the
> > >> device driver itself.
> > >> 
> > >> I’m wondering how to deal with this problem. Would it be OK if we mapped the
> > >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
> > >> This would help devices such as LSI to circumvent this problem. One problem
> > >> with this approach is that it has the potential to collide with another legitimate
> > >> IOVA address. Kindly share your thought on this.
> > >> 
> > >> Thank you!  
> > > 
> > > I am not 100% sure what do you plan to do but it sounds fine since even
> > > if it collides, with traditional PCI device must never initiate cycles  
> > 
> > OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
> 
> I don't think this is correct.  Look for instance at ACPI _TRA support
> where a system can specify a translation offset such that, for example,
> a CPU access to a device is required to add the provided offset to the
> bus address of the device.  A system using this could have multiple
> root bridges, where each is given the same, overlapping MMIO aperture.
> >From the processor perspective, each MMIO range is unique and possibly
> none of those devices have a zero _TRA, there could be system memory at
> the equivalent flat memory address.

I am guessing there are reasons to have these in acpi besides firmware
vendors wanting to find corner cases in device implementations though
:). E.g. it's possible something else is tweaking DMA in similar ways. I
can't say for sure and I wonder why do we care as long as QEMU does not
have _TRA.


> So if the transaction actually hits this bus, which I think is what
> making use of the device AddressSpace implies, I don't think it can
> assume that it's simply reflected back at itself.  Conventional PCI and
> PCI Express may be software compatible, but there's a reason we don't
> see IOMMUs that provide both translation and isolation in conventional
> topologies.
> 
> Is this more a bug in the LSI device emulation model?  For instance in
> vfio-pci, if I want to access an offset into a BAR from within QEMU, I
> don't care what address is programmed into that BAR, I perform an
> access relative to the vfio file descriptor region representing that
> BAR space.  I'd expect that any viable device emulation model does the
> same, an access to device memory uses an offset from an internal
> resource, irrespective of the BAR address.

However, using BAR seems like a reasonable shortcut allowing
device to use the same 64 bit address to refer to system
and device RAM interchangeably.

> It would seem strange if the driver is actually programming the device
> to DMA to itself and if that's actually happening, I'd wonder if this
> driver is actually compatible with an IOMMU on bare metal.

You know, it's hardware after all. As long as things work vendors
happily ship the device. They don't really worry about theoretical issues ;).

> > > within their own BAR range, and PCIe is software-compatible with PCI. So
> > > devices won't be able to access this IOVA even if it was programmed in
> > > the IOMMU.
> > > 
> > > As was mentioned elsewhere on this thread, devices accessing each
> > > other's BAR is a different matter.
> > > 
> > > I do not remember which rules apply to multiple functions of a
> > > multi-function device though. I think in a traditional PCI
> > > they will never go out on the bus, but with e.g. SRIOV they
> > > would probably do go out? Alex, any idea?
> 
> This falls under implementation specific behavior in the spec, IIRC.
> This is actually why IOMMU grouping requires ACS support on
> multi-function devices to clarify the behavior of p2p between functions
> in the same slot.  Thanks,
> 
> Alex

thanks!

-- 
MST



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10 22:53                                           ` Michael S. Tsirkin
@ 2022-02-10 23:46                                             ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-02-10 23:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, quintela, qemu-devel, armbru,
	Alex Williamson, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé



> On Feb 10, 2022, at 5:53 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Feb 10, 2022 at 10:23:01PM +0000, Jag Raman wrote:
>> 
>> 
>>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:
>>>> 
>>>> 
>>>>> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>> 
>>>>> On Wed, 2 Feb 2022 01:13:22 +0000
>>>>> Jag Raman <jag.raman@oracle.com> wrote:
>>>>> 
>>>>>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>>> 
>>>>>>> On Tue, 1 Feb 2022 21:24:08 +0000
>>>>>>> Jag Raman <jag.raman@oracle.com> wrote:
>>>>>>> 
>>>>>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>> On Tue, 1 Feb 2022 09:30:35 +0000
>>>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:    
>>>>>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
>>>>>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:      
>>>>>>>>>>>>> If the goal here is to restrict DMA between devices, ie. peer-to-peer
>>>>>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?        
>>>>>>>>>>>> 
>>>>>>>>>>>> The issue Dave raised is that vfio-user servers run in separate
>>>>>>>>>>>> processses from QEMU with shared memory access to RAM but no direct
>>>>>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
>>>>>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of DMA
>>>>>>>>>>>> requests.
>>>>>>>>>>>> 
>>>>>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
>>>>>>>>>>>> protocol already has messages that vfio-user servers can use as a
>>>>>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
>>>>>>>>>>>> accesses.
>>>>>>>>>>>> 
>>>>>>>>>>>>> In
>>>>>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
>>>>>>>>>>>>> address space per BDF.  Is the dynamic mapping overhead too much?  What
>>>>>>>>>>>>> physical hardware properties or specifications could we leverage to
>>>>>>>>>>>>> restrict p2p mappings to a device?  Should it be governed by machine
>>>>>>>>>>>>> type to provide consistency between devices?  Should each "isolated"
>>>>>>>>>>>>> bus be in a separate root complex?  Thanks,        
>>>>>>>>>>>> 
>>>>>>>>>>>> There is a separate issue in this patch series regarding isolating the
>>>>>>>>>>>> address space where BAR accesses are made (i.e. the global
>>>>>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
>>>>>>>>>>>> server instances (e.g. a software-defined network switch with multiple
>>>>>>>>>>>> ethernet devices) then each instance needs isolated memory and io address
>>>>>>>>>>>> spaces so that vfio-user clients don't cause collisions when they map
>>>>>>>>>>>> BARs to the same address.
>>>>>>>>>>>> 
>>>>>>>>>>>> I think the the separate root complex idea is a good solution. This
>>>>>>>>>>>> patch series takes a different approach by adding the concept of
>>>>>>>>>>>> isolated address spaces into hw/pci/.      
>>>>>>>>>>> 
>>>>>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
>>>>>>>>>>> same vCPU address space, perhaps with the exception of when they're
>>>>>>>>>>> being sized, but DMA should be disabled during sizing.
>>>>>>>>>>> 
>>>>>>>>>>> Devices within the same VM context with identical BARs would need to
>>>>>>>>>>> operate in different address spaces.  For example a translation offset
>>>>>>>>>>> in the vCPU address space would allow unique addressing to the devices,
>>>>>>>>>>> perhaps using the translation offset bits to address a root complex and
>>>>>>>>>>> masking those bits for downstream transactions.
>>>>>>>>>>> 
>>>>>>>>>>> In general, the device simply operates in an address space, ie. an
>>>>>>>>>>> IOVA.  When a mapping is made within that address space, we perform a
>>>>>>>>>>> translation as necessary to generate a guest physical address.  The
>>>>>>>>>>> IOVA itself is only meaningful within the context of the address space,
>>>>>>>>>>> there is no requirement or expectation for it to be globally unique.
>>>>>>>>>>> 
>>>>>>>>>>> If the vfio-user server is making some sort of requirement that IOVAs
>>>>>>>>>>> are unique across all devices, that seems very, very wrong.  Thanks,      
>>>>>>>>>> 
>>>>>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
>>>>>>>>>> 
>>>>>>>>>> The issue is that there can be as many guest physical address spaces as
>>>>>>>>>> there are vfio-user clients connected, so per-client isolated address
>>>>>>>>>> spaces are required. This patch series has a solution to that problem
>>>>>>>>>> with the new pci_isol_as_mem/io() API.    
>>>>>>>>> 
>>>>>>>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
>>>>>>>>> devices across many VMs (I'm not sure if you're referring to the device
>>>>>>>>> or the VM as a client) needs to deal with different address spaces per
>>>>>>>>> device.  The server needs to be able to uniquely identify every DMA,
>>>>>>>>> which must be part of the interface protocol.  But I don't see how that
>>>>>>>>> imposes a requirement of an isolated address space.  If we want the
>>>>>>>>> device isolated because we don't trust the server, that's where an IOMMU
>>>>>>>>> provides per device isolation.  What is the restriction of the
>>>>>>>>> per-client isolated address space and why do we need it?  The server
>>>>>>>>> needing to support multiple clients is not a sufficient answer to
>>>>>>>>> impose new PCI bus types with an implicit restriction on the VM.    
>>>>>>>> 
>>>>>>>> Hi Alex,
>>>>>>>> 
>>>>>>>> I believe there are two separate problems with running PCI devices in
>>>>>>>> the vfio-user server. The first one is concerning memory isolation and
>>>>>>>> second one is vectoring of BAR accesses (as explained below).
>>>>>>>> 
>>>>>>>> In our previous patches (v3), we used an IOMMU to isolate memory
>>>>>>>> spaces. But we still had trouble with the vectoring. So we implemented
>>>>>>>> separate address spaces for each PCIBus to tackle both problems
>>>>>>>> simultaneously, based on the feedback we got.
>>>>>>>> 
>>>>>>>> The following gives an overview of issues concerning vectoring of
>>>>>>>> BAR accesses.
>>>>>>>> 
>>>>>>>> The device’s BAR regions are mapped into the guest physical address
>>>>>>>> space. The guest writes the guest PA of each BAR into the device’s BAR
>>>>>>>> registers. To access the BAR regions of the device, QEMU uses
>>>>>>>> address_space_rw() which vectors the physical address access to the
>>>>>>>> device BAR region handlers.  
>>>>>>> 
>>>>>>> The guest physical address written to the BAR is irrelevant from the
>>>>>>> device perspective, this only serves to assign the BAR an offset within
>>>>>>> the address_space_mem, which is used by the vCPU (and possibly other
>>>>>>> devices depending on their address space).  There is no reason for the
>>>>>>> device itself to care about this address.  
>>>>>> 
>>>>>> Thank you for the explanation, Alex!
>>>>>> 
>>>>>> The confusion at my part is whether we are inside the device already when
>>>>>> the server receives a request to access BAR region of a device. Based on
>>>>>> your explanation, I get that your view is the BAR access request has
>>>>>> propagated into the device already, whereas I was under the impression
>>>>>> that the request is still on the CPU side of the PCI root complex.
>>>>> 
>>>>> If you are getting an access through your MemoryRegionOps, all the
>>>>> translations have been made, you simply need to use the hwaddr as the
>>>>> offset into the MemoryRegion for the access.  Perform the read/write to
>>>>> your device, no further translations required.
>>>>> 
>>>>>> Your view makes sense to me - once the BAR access request reaches the
>>>>>> client (on the other side), we could consider that the request has reached
>>>>>> the device.
>>>>>> 
>>>>>> On a separate note, if devices don’t care about the values in BAR
>>>>>> registers, why do the default PCI config handlers intercept and map
>>>>>> the BAR region into address_space_mem?
>>>>>> (pci_default_write_config() -> pci_update_mappings())
>>>>> 
>>>>> This is the part that's actually placing the BAR MemoryRegion as a
>>>>> sub-region into the vCPU address space.  I think if you track it,
>>>>> you'll see PCIDevice.io_regions[i].address_space is actually
>>>>> system_memory, which is used to initialize address_space_system.
>>>>> 
>>>>> The machine assembles PCI devices onto buses as instructed by the
>>>>> command line or hot plug operations.  It's the responsibility of the
>>>>> guest firmware and guest OS to probe those devices, size the BARs, and
>>>>> place the BARs into the memory hierarchy of the PCI bus, ie. system
>>>>> memory.  The BARs are necessarily in the "guest physical memory" for
>>>>> vCPU access, but it's essentially only coincidental that PCI devices
>>>>> might be in an address space that provides a mapping to their own BAR.
>>>>> There's no reason to ever use it.
>>>>> 
>>>>> In the vIOMMU case, we can't know that the device address space
>>>>> includes those BAR mappings or if they do, that they're identity mapped
>>>>> to the physical address.  Devices really need to not infer anything
>>>>> about an address.  Think about real hardware, a device is told by
>>>>> driver programming to perform a DMA operation.  The device doesn't know
>>>>> the target of that operation, it's the guest driver's responsibility to
>>>>> make sure the IOVA within the device address space is valid and maps to
>>>>> the desired target.  Thanks,
>>>> 
>>>> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
>>>> helped to clarify this problem.
>>>> 
>>>> We have implemented the memory isolation based on the discussion in the
>>>> thread. We will send the patches out shortly.
>>>> 
>>>> Devices such as “name" and “e1000” worked fine. But I’d like to note that
>>>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
>>>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
>>>> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
>>>> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
>>>> the device to access other BAR regions by using the BAR address programmed
>>>> in the PCI config space. This happens even without vfio-user patches. For example,
>>>> we could enable IOMMU using “-device intel-iommu” QEMU option and also
>>>> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
>>>> In this case, we could see an IOMMU fault.
>>> 
>>> So, device accessing its own BAR is different. Basically, these
>>> transactions never go on the bus at all, never mind get to the IOMMU.
>> 
>> Hi Michael,
>> 
>> In LSI case, I did notice that it went to the IOMMU.
> 
> Hmm do you mean you analyzed how a physical device works?
> Or do you mean in QEMU?

I mean in QEMU, I did not analyze a physical device.
> 
>> The device is reading the BAR
>> address as if it was a DMA address.
> 
> I got that, my understanding of PCI was that a device can
> not be both a master and a target of a transaction at
> the same time though. Could not find this in the spec though,
> maybe I remember incorrectly.

I see, OK. If this were to happen in a real device, PCI would raise
an error because the master and target of a transaction can’t be
the same. So you believe that this access is handled inside the
device, and doesn’t go out.

Thanks!
--
Jag

> 
>>> I think it's just used as a handle to address internal device memory.
>>> This kind of trick is not universal, but not terribly unusual.
>>> 
>>> 
>>>> Unfortunately, we started off our project with the LSI device. So that lead to all the
>>>> confusion about what is expected at the server end in-terms of
>>>> vectoring/address-translation. It gave an impression as if the request was still on
>>>> the CPU side of the PCI root complex, but the actual problem was with the
>>>> device driver itself.
>>>> 
>>>> I’m wondering how to deal with this problem. Would it be OK if we mapped the
>>>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
>>>> This would help devices such as LSI to circumvent this problem. One problem
>>>> with this approach is that it has the potential to collide with another legitimate
>>>> IOVA address. Kindly share your thought on this.
>>>> 
>>>> Thank you!
>>> 
>>> I am not 100% sure what do you plan to do but it sounds fine since even
>>> if it collides, with traditional PCI device must never initiate cycles
>> 
>> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
>> 
>> Thank you!
>> --
>> Jag
>> 
>>> within their own BAR range, and PCIe is software-compatible with PCI. So
>>> devices won't be able to access this IOVA even if it was programmed in
>>> the IOMMU.
>>> 
>>> As was mentioned elsewhere on this thread, devices accessing each
>>> other's BAR is a different matter.
>>> 
>>> I do not remember which rules apply to multiple functions of a
>>> multi-function device though. I think in a traditional PCI
>>> they will never go out on the bus, but with e.g. SRIOV they
>>> would probably do go out? Alex, any idea?
>>> 
>>> 
>>>> --
>>>> Jag
>>>> 
>>>>> 
>>>>> Alex


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10 23:28                                             ` Michael S. Tsirkin
@ 2022-02-10 23:49                                               ` Alex Williamson
  2022-02-11  0:26                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 99+ messages in thread
From: Alex Williamson @ 2022-02-10 23:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	quintela, qemu-devel, armbru, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, Stefan Hajnoczi,
	thanos.makatos, Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

On Thu, 10 Feb 2022 18:28:56 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
> > On Thu, 10 Feb 2022 22:23:01 +0000
> > Jag Raman <jag.raman@oracle.com> wrote:
> >   
> > > > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > 
> > > > On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:    
> > > >> 
> > > >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> > > >> helped to clarify this problem.
> > > >> 
> > > >> We have implemented the memory isolation based on the discussion in the
> > > >> thread. We will send the patches out shortly.
> > > >> 
> > > >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> > > >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> > > >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> > > >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> > > >> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
> > > >> the device to access other BAR regions by using the BAR address programmed
> > > >> in the PCI config space. This happens even without vfio-user patches. For example,
> > > >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> > > >> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
> > > >> In this case, we could see an IOMMU fault.    
> > > > 
> > > > So, device accessing its own BAR is different. Basically, these
> > > > transactions never go on the bus at all, never mind get to the IOMMU.    
> > > 
> > > Hi Michael,
> > > 
> > > In LSI case, I did notice that it went to the IOMMU. The device is reading the BAR
> > > address as if it was a DMA address.
> > >   
> > > > I think it's just used as a handle to address internal device memory.
> > > > This kind of trick is not universal, but not terribly unusual.
> > > > 
> > > >     
> > > >> Unfortunately, we started off our project with the LSI device. So that lead to all the
> > > >> confusion about what is expected at the server end in-terms of
> > > >> vectoring/address-translation. It gave an impression as if the request was still on
> > > >> the CPU side of the PCI root complex, but the actual problem was with the
> > > >> device driver itself.
> > > >> 
> > > >> I’m wondering how to deal with this problem. Would it be OK if we mapped the
> > > >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
> > > >> This would help devices such as LSI to circumvent this problem. One problem
> > > >> with this approach is that it has the potential to collide with another legitimate
> > > >> IOVA address. Kindly share your thought on this.
> > > >> 
> > > >> Thank you!    
> > > > 
> > > > I am not 100% sure what do you plan to do but it sounds fine since even
> > > > if it collides, with traditional PCI device must never initiate cycles    
> > > 
> > > OK sounds good, I’ll create a mapping of the device BARs in the IOVA.  
> > 
> > I don't think this is correct.  Look for instance at ACPI _TRA support
> > where a system can specify a translation offset such that, for example,
> > a CPU access to a device is required to add the provided offset to the
> > bus address of the device.  A system using this could have multiple
> > root bridges, where each is given the same, overlapping MMIO aperture.  
> > >From the processor perspective, each MMIO range is unique and possibly  
> > none of those devices have a zero _TRA, there could be system memory at
> > the equivalent flat memory address.  
> 
> I am guessing there are reasons to have these in acpi besides firmware
> vendors wanting to find corner cases in device implementations though
> :). E.g. it's possible something else is tweaking DMA in similar ways. I
> can't say for sure and I wonder why do we care as long as QEMU does not
> have _TRA.

How many complaints do we get about running out of I/O port space on
q35 because we allow an arbitrary number of root ports?  What if we
used _TRA to provide the full I/O port range per root port?  32-bit
MMIO could be duplicated as well.

> > So if the transaction actually hits this bus, which I think is what
> > making use of the device AddressSpace implies, I don't think it can
> > assume that it's simply reflected back at itself.  Conventional PCI and
> > PCI Express may be software compatible, but there's a reason we don't
> > see IOMMUs that provide both translation and isolation in conventional
> > topologies.
> > 
> > Is this more a bug in the LSI device emulation model?  For instance in
> > vfio-pci, if I want to access an offset into a BAR from within QEMU, I
> > don't care what address is programmed into that BAR, I perform an
> > access relative to the vfio file descriptor region representing that
> > BAR space.  I'd expect that any viable device emulation model does the
> > same, an access to device memory uses an offset from an internal
> > resource, irrespective of the BAR address.  
> 
> However, using BAR seems like a reasonable shortcut allowing
> device to use the same 64 bit address to refer to system
> and device RAM interchangeably.

A shortcut that breaks when an IOMMU is involved.

> > It would seem strange if the driver is actually programming the device
> > to DMA to itself and if that's actually happening, I'd wonder if this
> > driver is actually compatible with an IOMMU on bare metal.  
> 
> You know, it's hardware after all. As long as things work vendors
> happily ship the device. They don't really worry about theoretical issues ;).

We're talking about a 32-bit conventional PCI device from the previous
century.  IOMMUs are no longer theoretical, but not all drivers have
kept up.  It's maybe not the best choice as the de facto standard
device, imo.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10 23:17                                           ` Alex Williamson
  2022-02-10 23:28                                             ` Michael S. Tsirkin
@ 2022-02-11  0:10                                             ` Jag Raman
  1 sibling, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-02-11  0:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, Michael S. Tsirkin, qemu-devel,
	armbru, quintela, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé



> On Feb 10, 2022, at 6:17 PM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Thu, 10 Feb 2022 22:23:01 +0000
> Jag Raman <jag.raman@oracle.com> wrote:
> 
>>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> 
>>> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:  
>>>> 
>>>> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
>>>> helped to clarify this problem.
>>>> 
>>>> We have implemented the memory isolation based on the discussion in the
>>>> thread. We will send the patches out shortly.
>>>> 
>>>> Devices such as “name" and “e1000” worked fine. But I’d like to note that
>>>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
>>>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
>>>> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
>>>> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
>>>> the device to access other BAR regions by using the BAR address programmed
>>>> in the PCI config space. This happens even without vfio-user patches. For example,
>>>> we could enable IOMMU using “-device intel-iommu” QEMU option and also
>>>> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
>>>> In this case, we could see an IOMMU fault.  
>>> 
>>> So, device accessing its own BAR is different. Basically, these
>>> transactions never go on the bus at all, never mind get to the IOMMU.  
>> 
>> Hi Michael,
>> 
>> In LSI case, I did notice that it went to the IOMMU. The device is reading the BAR
>> address as if it was a DMA address.
>> 
>>> I think it's just used as a handle to address internal device memory.
>>> This kind of trick is not universal, but not terribly unusual.
>>> 
>>> 
>>>> Unfortunately, we started off our project with the LSI device. So that lead to all the
>>>> confusion about what is expected at the server end in-terms of
>>>> vectoring/address-translation. It gave an impression as if the request was still on
>>>> the CPU side of the PCI root complex, but the actual problem was with the
>>>> device driver itself.
>>>> 
>>>> I’m wondering how to deal with this problem. Would it be OK if we mapped the
>>>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
>>>> This would help devices such as LSI to circumvent this problem. One problem
>>>> with this approach is that it has the potential to collide with another legitimate
>>>> IOVA address. Kindly share your thought on this.
>>>> 
>>>> Thank you!  
>>> 
>>> I am not 100% sure what do you plan to do but it sounds fine since even
>>> if it collides, with traditional PCI device must never initiate cycles  
>> 
>> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
> 
> I don't think this is correct.  Look for instance at ACPI _TRA support
> where a system can specify a translation offset such that, for example,
> a CPU access to a device is required to add the provided offset to the
> bus address of the device.  A system using this could have multiple
> root bridges, where each is given the same, overlapping MMIO aperture.
> From the processor perspective, each MMIO range is unique and possibly
> none of those devices have a zero _TRA, there could be system memory at
> the equivalent flat memory address.
> 
> So if the transaction actually hits this bus, which I think is what
> making use of the device AddressSpace implies, I don't think it can
> assume that it's simply reflected back at itself.  Conventional PCI and
> PCI Express may be software compatible, but there's a reason we don't
> see IOMMUs that provide both translation and isolation in conventional
> topologies.
> 
> Is this more a bug in the LSI device emulation model?  For instance in
> vfio-pci, if I want to access an offset into a BAR from within QEMU, I
> don't care what address is programmed into that BAR, I perform an
> access relative to the vfio file descriptor region representing that
> BAR space.  I'd expect that any viable device emulation model does the
> same, an access to device memory uses an offset from an internal
> resource, irrespective of the BAR address.
> 
> It would seem strange if the driver is actually programming the device
> to DMA to itself and if that's actually happening, I'd wonder if this

It does look like the driver is actually programming the device to DMA to itself.

The driver first programs the DSP (DMA Scripts Pointer) register with the BAR
address. It does so by performing a series of MMIO writes (lsi_mmio_write())
to offsets 0x2C - 0x2F. Immediately after programming this register, the device
fetches some instructions located at the programmed address.

Thank you!
--
Jag

> driver is actually compatible with an IOMMU on bare metal.
> 
>>> within their own BAR range, and PCIe is software-compatible with PCI. So
>>> devices won't be able to access this IOVA even if it was programmed in
>>> the IOMMU.
>>> 
>>> As was mentioned elsewhere on this thread, devices accessing each
>>> other's BAR is a different matter.
>>> 
>>> I do not remember which rules apply to multiple functions of a
>>> multi-function device though. I think in a traditional PCI
>>> they will never go out on the bus, but with e.g. SRIOV they
>>> would probably do go out? Alex, any idea?
> 
> This falls under implementation specific behavior in the spec, IIRC.
> This is actually why IOMMU grouping requires ACS support on
> multi-function devices to clarify the behavior of p2p between functions
> in the same slot.  Thanks,
> 
> Alex
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-10 23:49                                               ` Alex Williamson
@ 2022-02-11  0:26                                                 ` Michael S. Tsirkin
  2022-02-11  0:54                                                   ` Jag Raman
  0 siblings, 1 reply; 99+ messages in thread
From: Michael S. Tsirkin @ 2022-02-11  0:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eduardo, Elena Ufimtseva, Jag Raman, Beraldo Leal, John Johnson,
	quintela, qemu-devel, armbru, Paolo Bonzini,
	Marc-André Lureau, Dr. David Alan Gilbert, Stefan Hajnoczi,
	thanos.makatos, Daniel P. Berrangé,
	Eric Blake, john.levon, Philippe Mathieu-Daudé

On Thu, Feb 10, 2022 at 04:49:33PM -0700, Alex Williamson wrote:
> On Thu, 10 Feb 2022 18:28:56 -0500
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
> > > On Thu, 10 Feb 2022 22:23:01 +0000
> > > Jag Raman <jag.raman@oracle.com> wrote:
> > >   
> > > > > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > 
> > > > > On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:    
> > > > >> 
> > > > >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> > > > >> helped to clarify this problem.
> > > > >> 
> > > > >> We have implemented the memory isolation based on the discussion in the
> > > > >> thread. We will send the patches out shortly.
> > > > >> 
> > > > >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> > > > >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> > > > >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> > > > >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> > > > >> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
> > > > >> the device to access other BAR regions by using the BAR address programmed
> > > > >> in the PCI config space. This happens even without vfio-user patches. For example,
> > > > >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> > > > >> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
> > > > >> In this case, we could see an IOMMU fault.    
> > > > > 
> > > > > So, device accessing its own BAR is different. Basically, these
> > > > > transactions never go on the bus at all, never mind get to the IOMMU.    
> > > > 
> > > > Hi Michael,
> > > > 
> > > > In LSI case, I did notice that it went to the IOMMU. The device is reading the BAR
> > > > address as if it was a DMA address.
> > > >   
> > > > > I think it's just used as a handle to address internal device memory.
> > > > > This kind of trick is not universal, but not terribly unusual.
> > > > > 
> > > > >     
> > > > >> Unfortunately, we started off our project with the LSI device. So that lead to all the
> > > > >> confusion about what is expected at the server end in-terms of
> > > > >> vectoring/address-translation. It gave an impression as if the request was still on
> > > > >> the CPU side of the PCI root complex, but the actual problem was with the
> > > > >> device driver itself.
> > > > >> 
> > > > >> I’m wondering how to deal with this problem. Would it be OK if we mapped the
> > > > >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
> > > > >> This would help devices such as LSI to circumvent this problem. One problem
> > > > >> with this approach is that it has the potential to collide with another legitimate
> > > > >> IOVA address. Kindly share your thought on this.
> > > > >> 
> > > > >> Thank you!    
> > > > > 
> > > > > I am not 100% sure what do you plan to do but it sounds fine since even
> > > > > if it collides, with traditional PCI device must never initiate cycles    
> > > > 
> > > > OK sounds good, I’ll create a mapping of the device BARs in the IOVA.  
> > > 
> > > I don't think this is correct.  Look for instance at ACPI _TRA support
> > > where a system can specify a translation offset such that, for example,
> > > a CPU access to a device is required to add the provided offset to the
> > > bus address of the device.  A system using this could have multiple
> > > root bridges, where each is given the same, overlapping MMIO aperture.  
> > > >From the processor perspective, each MMIO range is unique and possibly  
> > > none of those devices have a zero _TRA, there could be system memory at
> > > the equivalent flat memory address.  
> > 
> > I am guessing there are reasons to have these in acpi besides firmware
> > vendors wanting to find corner cases in device implementations though
> > :). E.g. it's possible something else is tweaking DMA in similar ways. I
> > can't say for sure and I wonder why do we care as long as QEMU does not
> > have _TRA.
> 
> How many complaints do we get about running out of I/O port space on
> q35 because we allow an arbitrary number of root ports?  What if we
> used _TRA to provide the full I/O port range per root port?  32-bit
> MMIO could be duplicated as well.

It's an interesting idea. To clarify what I said, I suspect some devices
are broken in presence of translating bridges unless DMA
is also translated to match.

I agree it's a mess though, in that some devices when given their own
BAR to DMA to will probably just satisfy the access from internal
memory, while others will ignore that and send it up as DMA
and both types are probably out there in the field.


> > > So if the transaction actually hits this bus, which I think is what
> > > making use of the device AddressSpace implies, I don't think it can
> > > assume that it's simply reflected back at itself.  Conventional PCI and
> > > PCI Express may be software compatible, but there's a reason we don't
> > > see IOMMUs that provide both translation and isolation in conventional
> > > topologies.
> > > 
> > > Is this more a bug in the LSI device emulation model?  For instance in
> > > vfio-pci, if I want to access an offset into a BAR from within QEMU, I
> > > don't care what address is programmed into that BAR, I perform an
> > > access relative to the vfio file descriptor region representing that
> > > BAR space.  I'd expect that any viable device emulation model does the
> > > same, an access to device memory uses an offset from an internal
> > > resource, irrespective of the BAR address.  
> > 
> > However, using BAR seems like a reasonable shortcut allowing
> > device to use the same 64 bit address to refer to system
> > and device RAM interchangeably.
> 
> A shortcut that breaks when an IOMMU is involved.

Maybe. But if that's how hardware behaves, we have little choice but
emulate it.

> > > It would seem strange if the driver is actually programming the device
> > > to DMA to itself and if that's actually happening, I'd wonder if this
> > > driver is actually compatible with an IOMMU on bare metal.  
> > 
> > You know, it's hardware after all. As long as things work vendors
> > happily ship the device. They don't really worry about theoretical issues ;).
> 
> We're talking about a 32-bit conventional PCI device from the previous
> century.  IOMMUs are no longer theoretical, but not all drivers have
> kept up.  It's maybe not the best choice as the de facto standard
> device, imo.  Thanks,
> 
> Alex

More importantly lots devices most likely don't support arbitrary
configurations and break if you try to create something that matches the
spec but never or even rarely occurs on bare metal.

-- 
MST



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
  2022-02-11  0:26                                                 ` Michael S. Tsirkin
@ 2022-02-11  0:54                                                   ` Jag Raman
  0 siblings, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-02-11  0:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eduardo, Elena Ufimtseva, Daniel P. Berrangé,
	Beraldo Leal, John Johnson, quintela, qemu-devel, armbru,
	Alex Williamson, Marc-André Lureau, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Paolo Bonzini, thanos.makatos, Eric Blake,
	john.levon, Philippe Mathieu-Daudé



> On Feb 10, 2022, at 7:26 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Thu, Feb 10, 2022 at 04:49:33PM -0700, Alex Williamson wrote:
>> On Thu, 10 Feb 2022 18:28:56 -0500
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> 
>>> On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
>>>> On Thu, 10 Feb 2022 22:23:01 +0000
>>>> Jag Raman <jag.raman@oracle.com> wrote:
>>>> 
>>>>>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>> 
>>>>>> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:    
>>>>>>> 
>>>>>>> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
>>>>>>> helped to clarify this problem.
>>>>>>> 
>>>>>>> We have implemented the memory isolation based on the discussion in the
>>>>>>> thread. We will send the patches out shortly.
>>>>>>> 
>>>>>>> Devices such as “name" and “e1000” worked fine. But I’d like to note that
>>>>>>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
>>>>>>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
>>>>>>> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
>>>>>>> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
>>>>>>> the device to access other BAR regions by using the BAR address programmed
>>>>>>> in the PCI config space. This happens even without vfio-user patches. For example,
>>>>>>> we could enable IOMMU using “-device intel-iommu” QEMU option and also
>>>>>>> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
>>>>>>> In this case, we could see an IOMMU fault.    
>>>>>> 
>>>>>> So, device accessing its own BAR is different. Basically, these
>>>>>> transactions never go on the bus at all, never mind get to the IOMMU.    
>>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> In LSI case, I did notice that it went to the IOMMU. The device is reading the BAR
>>>>> address as if it was a DMA address.
>>>>> 
>>>>>> I think it's just used as a handle to address internal device memory.
>>>>>> This kind of trick is not universal, but not terribly unusual.
>>>>>> 
>>>>>> 
>>>>>>> Unfortunately, we started off our project with the LSI device. So that lead to all the
>>>>>>> confusion about what is expected at the server end in-terms of
>>>>>>> vectoring/address-translation. It gave an impression as if the request was still on
>>>>>>> the CPU side of the PCI root complex, but the actual problem was with the
>>>>>>> device driver itself.
>>>>>>> 
>>>>>>> I’m wondering how to deal with this problem. Would it be OK if we mapped the
>>>>>>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR registers?
>>>>>>> This would help devices such as LSI to circumvent this problem. One problem
>>>>>>> with this approach is that it has the potential to collide with another legitimate
>>>>>>> IOVA address. Kindly share your thought on this.
>>>>>>> 
>>>>>>> Thank you!    
>>>>>> 
>>>>>> I am not 100% sure what do you plan to do but it sounds fine since even
>>>>>> if it collides, with traditional PCI device must never initiate cycles    
>>>>> 
>>>>> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.  
>>>> 
>>>> I don't think this is correct.  Look for instance at ACPI _TRA support
>>>> where a system can specify a translation offset such that, for example,
>>>> a CPU access to a device is required to add the provided offset to the
>>>> bus address of the device.  A system using this could have multiple
>>>> root bridges, where each is given the same, overlapping MMIO aperture.  
>>>>> From the processor perspective, each MMIO range is unique and possibly  
>>>> none of those devices have a zero _TRA, there could be system memory at
>>>> the equivalent flat memory address.  
>>> 
>>> I am guessing there are reasons to have these in acpi besides firmware
>>> vendors wanting to find corner cases in device implementations though
>>> :). E.g. it's possible something else is tweaking DMA in similar ways. I
>>> can't say for sure and I wonder why do we care as long as QEMU does not
>>> have _TRA.
>> 
>> How many complaints do we get about running out of I/O port space on
>> q35 because we allow an arbitrary number of root ports?  What if we
>> used _TRA to provide the full I/O port range per root port?  32-bit
>> MMIO could be duplicated as well.
> 
> It's an interesting idea. To clarify what I said, I suspect some devices
> are broken in presence of translating bridges unless DMA
> is also translated to match.
> 
> I agree it's a mess though, in that some devices when given their own
> BAR to DMA to will probably just satisfy the access from internal
> memory, while others will ignore that and send it up as DMA
> and both types are probably out there in the field.
> 
> 
>>>> So if the transaction actually hits this bus, which I think is what
>>>> making use of the device AddressSpace implies, I don't think it can
>>>> assume that it's simply reflected back at itself.  Conventional PCI and
>>>> PCI Express may be software compatible, but there's a reason we don't
>>>> see IOMMUs that provide both translation and isolation in conventional
>>>> topologies.
>>>> 
>>>> Is this more a bug in the LSI device emulation model?  For instance in
>>>> vfio-pci, if I want to access an offset into a BAR from within QEMU, I
>>>> don't care what address is programmed into that BAR, I perform an
>>>> access relative to the vfio file descriptor region representing that
>>>> BAR space.  I'd expect that any viable device emulation model does the
>>>> same, an access to device memory uses an offset from an internal
>>>> resource, irrespective of the BAR address.  
>>> 
>>> However, using BAR seems like a reasonable shortcut allowing
>>> device to use the same 64 bit address to refer to system
>>> and device RAM interchangeably.
>> 
>> A shortcut that breaks when an IOMMU is involved.
> 
> Maybe. But if that's how hardware behaves, we have little choice but
> emulate it.

I was wondering if we could map the BARs into the IOVA for a limited set of
devices - the ones which are designed before IOMMU such as lsi53c895a.
This would ensure that we follow the spec to the best without breaking
existing devices?

--
Jag

> 
>>>> It would seem strange if the driver is actually programming the device
>>>> to DMA to itself and if that's actually happening, I'd wonder if this
>>>> driver is actually compatible with an IOMMU on bare metal.  
>>> 
>>> You know, it's hardware after all. As long as things work vendors
>>> happily ship the device. They don't really worry about theoretical issues ;).
>> 
>> We're talking about a 32-bit conventional PCI device from the previous
>> century.  IOMMUs are no longer theoretical, but not all drivers have
>> kept up.  It's maybe not the best choice as the de facto standard
>> device, imo.  Thanks,
>> 
>> Alex
> 
> More importantly lots devices most likely don't support arbitrary
> configurations and break if you try to create something that matches the
> spec but never or even rarely occurs on bare metal.
> 
> -- 
> MST
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v5 01/18] configure, meson: override C compiler for cmake
  2022-01-20 13:27   ` Paolo Bonzini
  2022-01-20 15:21     ` Jag Raman
@ 2022-02-17  6:10     ` Jag Raman
  1 sibling, 0 replies; 99+ messages in thread
From: Jag Raman @ 2022-02-17  6:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: eduardo, Elena Ufimtseva, berrange, bleal, John Johnson,
	john.levon, qemu-devel, armbru, quintela, mst, stefanha,
	thanos.makatos, marcandre.lureau, eblake, dgilbert, f4bug



> On Jan 20, 2022, at 8:27 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 1/19/22 22:41, Jagannathan Raman wrote:
>> The compiler path that cmake gets from meson is corrupted. It results in
>> the following error:
>> | -- The C compiler identification is unknown
>> | CMake Error at CMakeLists.txt:35 (project):
>> | The CMAKE_C_COMPILER:
>> | /opt/rh/devtoolset-9/root/bin/cc;-m64;-mcx16
>> | is not a full path to an existing compiler tool.
>> Explicitly specify the C compiler for cmake to avoid this error
>> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> This should not be needed anymore, as the bug in Meson has been fixed.

Hi Paolo,

I’m able to see the bug with latest QEMU. The fix doesn’t appear to be
available with meson version 0.59.3, which is what QEMU is
presently using.

Thank you!
--
Jag

> 
> Paolo
> 
>>  configure | 2 ++
>>  1 file changed, 2 insertions(+)
>> diff --git a/configure b/configure
>> index e1a31fb332..6a865f8713 100755
>> --- a/configure
>> +++ b/configure
>> @@ -3747,6 +3747,8 @@ if test "$skip_meson" = no; then
>>    echo "cpp_args = [$(meson_quote $CXXFLAGS $EXTRA_CXXFLAGS)]" >> $cross
>>    echo "c_link_args = [$(meson_quote $CFLAGS $LDFLAGS $EXTRA_CFLAGS $EXTRA_LDFLAGS)]" >> $cross
>>    echo "cpp_link_args = [$(meson_quote $CXXFLAGS $LDFLAGS $EXTRA_CXXFLAGS $EXTRA_LDFLAGS)]" >> $cross
>> +  echo "[cmake]" >> $cross
>> +  echo "CMAKE_C_COMPILER = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
>>    echo "[binaries]" >> $cross
>>    echo "c = [$(meson_quote $cc $CPU_CFLAGS)]" >> $cross
>>    test -n "$cxx" && echo "cpp = [$(meson_quote $cxx $CPU_CFLAGS)]" >> $cross
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2022-02-17  6:16 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-19 21:41 [PATCH v5 00/18] vfio-user server in QEMU Jagannathan Raman
2022-01-19 21:41 ` [PATCH v5 01/18] configure, meson: override C compiler for cmake Jagannathan Raman
2022-01-20 13:27   ` Paolo Bonzini
2022-01-20 15:21     ` Jag Raman
2022-02-17  6:10     ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 02/18] tests/avocado: Specify target VM argument to helper routines Jagannathan Raman
2022-01-25  9:40   ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 03/18] pci: isolated address space for PCI bus Jagannathan Raman
2022-01-20  0:12   ` Michael S. Tsirkin
2022-01-20 15:20     ` Jag Raman
2022-01-25 18:38       ` Dr. David Alan Gilbert
2022-01-26  5:27         ` Jag Raman
2022-01-26  9:45           ` Stefan Hajnoczi
2022-01-26 20:07             ` Dr. David Alan Gilbert
2022-01-26 21:13               ` Michael S. Tsirkin
2022-01-27  8:30                 ` Stefan Hajnoczi
2022-01-27 12:50                   ` Michael S. Tsirkin
2022-01-27 21:22                   ` Alex Williamson
2022-01-28  8:19                     ` Stefan Hajnoczi
2022-01-28  9:18                     ` Stefan Hajnoczi
2022-01-31 16:16                       ` Alex Williamson
2022-02-01  9:30                         ` Stefan Hajnoczi
2022-02-01 15:24                           ` Alex Williamson
2022-02-01 21:24                             ` Jag Raman
2022-02-01 22:47                               ` Alex Williamson
2022-02-02  1:13                                 ` Jag Raman
2022-02-02  5:34                                   ` Alex Williamson
2022-02-02  9:22                                     ` Stefan Hajnoczi
2022-02-10  0:08                                     ` Jag Raman
2022-02-10  8:02                                       ` Michael S. Tsirkin
2022-02-10 22:23                                         ` Jag Raman
2022-02-10 22:53                                           ` Michael S. Tsirkin
2022-02-10 23:46                                             ` Jag Raman
2022-02-10 23:17                                           ` Alex Williamson
2022-02-10 23:28                                             ` Michael S. Tsirkin
2022-02-10 23:49                                               ` Alex Williamson
2022-02-11  0:26                                                 ` Michael S. Tsirkin
2022-02-11  0:54                                                   ` Jag Raman
2022-02-11  0:10                                             ` Jag Raman
2022-02-02  9:30                                 ` Peter Maydell
2022-02-02 10:06                                   ` Michael S. Tsirkin
2022-02-02 15:49                                     ` Alex Williamson
2022-02-02 16:53                                       ` Michael S. Tsirkin
2022-02-02 17:12                                   ` Alex Williamson
2022-02-01 10:42                     ` Dr. David Alan Gilbert
2022-01-26 18:13           ` Dr. David Alan Gilbert
2022-01-27 17:43             ` Jag Raman
2022-01-25  9:56   ` Stefan Hajnoczi
2022-01-25 13:49     ` Jag Raman
2022-01-25 14:19       ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 04/18] pci: create and free isolated PCI buses Jagannathan Raman
2022-01-25 10:25   ` Stefan Hajnoczi
2022-01-25 14:10     ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 05/18] qdev: unplug blocker for devices Jagannathan Raman
2022-01-25 10:27   ` Stefan Hajnoczi
2022-01-25 14:43     ` Jag Raman
2022-01-26  9:32       ` Stefan Hajnoczi
2022-01-26 15:13         ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 06/18] vfio-user: add HotplugHandler for remote machine Jagannathan Raman
2022-01-25 10:32   ` Stefan Hajnoczi
2022-01-25 18:12     ` Jag Raman
2022-01-26  9:35       ` Stefan Hajnoczi
2022-01-26 15:20         ` Jag Raman
2022-01-26 15:43           ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 07/18] vfio-user: set qdev bus callbacks " Jagannathan Raman
2022-01-25 10:44   ` Stefan Hajnoczi
2022-01-25 21:12     ` Jag Raman
2022-01-26  9:37       ` Stefan Hajnoczi
2022-01-26 15:51         ` Jag Raman
2022-01-19 21:41 ` [PATCH v5 08/18] vfio-user: build library Jagannathan Raman
2022-01-19 21:41 ` [PATCH v5 09/18] vfio-user: define vfio-user-server object Jagannathan Raman
2022-01-25 14:40   ` Stefan Hajnoczi
2022-01-19 21:41 ` [PATCH v5 10/18] vfio-user: instantiate vfio-user context Jagannathan Raman
2022-01-25 14:44   ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 11/18] vfio-user: find and init PCI device Jagannathan Raman
2022-01-25 14:48   ` Stefan Hajnoczi
2022-01-26  3:14     ` Jag Raman
2022-01-19 21:42 ` [PATCH v5 12/18] vfio-user: run vfio-user context Jagannathan Raman
2022-01-25 15:10   ` Stefan Hajnoczi
2022-01-26  3:26     ` Jag Raman
2022-01-19 21:42 ` [PATCH v5 13/18] vfio-user: handle PCI config space accesses Jagannathan Raman
2022-01-25 15:13   ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 14/18] vfio-user: handle DMA mappings Jagannathan Raman
2022-01-19 21:42 ` [PATCH v5 15/18] vfio-user: handle PCI BAR accesses Jagannathan Raman
2022-01-19 21:42 ` [PATCH v5 16/18] vfio-user: handle device interrupts Jagannathan Raman
2022-01-25 15:25   ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 17/18] vfio-user: register handlers to facilitate migration Jagannathan Raman
2022-01-25 15:48   ` Stefan Hajnoczi
2022-01-27 17:04     ` Jag Raman
2022-01-28  8:29       ` Stefan Hajnoczi
2022-01-28 14:49         ` Thanos Makatos
2022-02-01  3:49         ` Jag Raman
2022-02-01  9:37           ` Stefan Hajnoczi
2022-01-19 21:42 ` [PATCH v5 18/18] vfio-user: avocado tests for vfio-user Jagannathan Raman
2022-01-26  4:25   ` Philippe Mathieu-Daudé via
2022-01-26 15:12     ` Jag Raman
2022-01-25 16:00 ` [PATCH v5 00/18] vfio-user server in QEMU Stefan Hajnoczi
2022-01-26  5:04   ` Jag Raman
2022-01-26  9:56     ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.