All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] vl: Prioritize device realizations
@ 2021-08-18 19:42 Peter Xu
  2021-08-18 19:42 ` [PATCH 1/4] qdev-monitor: Trace qdev creation Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-18 19:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, peterx, Dr . David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini

This is a long pending issue that we haven't fixed.  The issue is in QEMU we
have implicit device ordering requirement when realizing, otherwise some of the
device may not work properly.

The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
needs to be created before vfio-pci otherwise vfio-pci will stop working when
the guest enables the vIOMMU and the device at the same time.

AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
they need to pay attention or things will stop working at some point.

Recently there's a growing and similar requirement on vDPA.  It's not a hard
requirement so far but vDPA has patches that try to workaround this issue.

This patchset allows us to realize the devices in the order that e.g. platform
devices will be created first (bus device, IOMMU, etc.), then the rest of
normal devices.  It's done simply by ordering the QemuOptsList of "device"
entries before realization.  The priority so far comes from migration
priorities which could be a little bit odd, but that's really about the same
problem and we can clean that part up in the future.

Libvirt can still keep its ordering for sure so old QEMU will still work,
however that won't be needed for new qemus after this patchset, so with the new
binary we should be able to specify qemu cmdline as wish on '-device'.

Logically this should also work for vDPA and the workaround code can be done
with more straightforward approaches.

Please review, thanks.

Peter Xu (4):
  qdev-monitor: Trace qdev creation
  qemu-config: Allow in-place sorting of QemuOptsList
  qdev: Export qdev_get_device_class()
  vl: Prioritize realizations of devices

 include/monitor/qdev.h     |  2 ++
 include/qemu/config-file.h |  4 ++++
 softmmu/qdev-monitor.c     |  4 +++-
 softmmu/trace-events       |  3 +++
 softmmu/vl.c               | 35 +++++++++++++++++++++++++++
 util/qemu-config.c         | 48 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 95 insertions(+), 1 deletion(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/4] qdev-monitor: Trace qdev creation
  2021-08-18 19:42 [PATCH 0/4] vl: Prioritize device realizations Peter Xu
@ 2021-08-18 19:42 ` Peter Xu
  2021-08-18 19:43 ` [PATCH 2/4] qemu-config: Allow in-place sorting of QemuOptsList Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-18 19:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, peterx, Dr . David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini

Add a tracepoint for creations of qdev.  This can be used to see what devices
are created under the hood, along with the ordering of their creation.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 softmmu/qdev-monitor.c | 2 ++
 softmmu/trace-events   | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 721dec2d82..8602164082 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -40,6 +40,7 @@
 #include "qemu/cutils.h"
 #include "hw/qdev-properties.h"
 #include "hw/clock.h"
+#include "trace.h"
 
 /*
  * Aliases were a bad idea from the start.  Let's keep them
@@ -674,6 +675,7 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
         dev->opts = NULL;
         goto err_del_dev;
     }
+    trace_qdev_device_add(driver, qemu_opts_id(opts));
     return dev;
 
 err_del_dev:
diff --git a/softmmu/trace-events b/softmmu/trace-events
index 7b278590a0..fad85e9d5c 100644
--- a/softmmu/trace-events
+++ b/softmmu/trace-events
@@ -29,3 +29,6 @@ runstate_set(int current_state, const char *current_state_str, int new_state, co
 system_wakeup_request(int reason) "reason=%d"
 qemu_system_shutdown_request(int reason) "reason=%d"
 qemu_system_powerdown_request(void) ""
+
+# qdev-monitor.c
+qdev_device_add(const char *driver, const char *id) "driver '%s' id '%s'"
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/4] qemu-config: Allow in-place sorting of QemuOptsList
  2021-08-18 19:42 [PATCH 0/4] vl: Prioritize device realizations Peter Xu
  2021-08-18 19:42 ` [PATCH 1/4] qdev-monitor: Trace qdev creation Peter Xu
@ 2021-08-18 19:43 ` Peter Xu
  2021-08-18 19:43 ` [PATCH 3/4] qdev: Export qdev_get_device_class() Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-18 19:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, peterx, Dr . David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini

Add the helper qemu_sort_opts() to allow in-place sorting of QemuOptsList.  The
function can be specified in the form defined as qemu_opts_pri_fn(), where it
takes a QemuOpts pointer and generates a number showing the priority of this
QemuOpts entry.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/qemu/config-file.h |  4 ++++
 util/qemu-config.c         | 48 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/qemu/config-file.h b/include/qemu/config-file.h
index f605423321..ce50a72985 100644
--- a/include/qemu/config-file.h
+++ b/include/qemu/config-file.h
@@ -3,8 +3,12 @@
 
 typedef void QEMUConfigCB(const char *group, QDict *qdict, void *opaque, Error **errp);
 
+/* Returns the priority for a QemuOpts */
+typedef int (*qemu_opts_pri_fn)(QemuOpts *opt);
+
 void qemu_load_module_for_opts(const char *group);
 QemuOptsList *qemu_find_opts(const char *group);
+void qemu_sort_opts(const char *group, qemu_opts_pri_fn fn);
 QemuOptsList *qemu_find_opts_err(const char *group, Error **errp);
 QemuOpts *qemu_find_opts_singleton(const char *group);
 
diff --git a/util/qemu-config.c b/util/qemu-config.c
index 436ab63b16..e882dc948b 100644
--- a/util/qemu-config.c
+++ b/util/qemu-config.c
@@ -7,6 +7,7 @@
 #include "qapi/qmp/qlist.h"
 #include "qemu/error-report.h"
 #include "qemu/option.h"
+#include "qemu/option_int.h"
 #include "qemu/config-file.h"
 
 static QemuOptsList *vm_config_groups[48];
@@ -41,6 +42,53 @@ QemuOptsList *qemu_find_opts(const char *group)
     return ret;
 }
 
+struct QemuOptsSortEntry {
+    QemuOpts *opts;
+    int priority;
+} __attribute__ ((__aligned__(sizeof(void *))));
+typedef struct QemuOptsSortEntry QemuOptsSortEntry;
+
+static int qemu_opts_cmp_fn(const void *opts_1, const void *opts_2)
+{
+    QemuOptsSortEntry *entry1, *entry2;
+
+    entry1 = (QemuOptsSortEntry *)opts_1;
+    entry2 = (QemuOptsSortEntry *)opts_2;
+
+    return entry1->priority - entry2->priority;
+}
+
+void qemu_sort_opts(const char *group, qemu_opts_pri_fn fn)
+{
+    QemuOptsSortEntry *entries, *entry;
+    QemuOpts *opts, *next_opts;
+    int i = 0, count = 0;
+    QemuOptsList *list;
+
+    list = find_list(vm_config_groups, group, &error_abort);
+
+    QTAILQ_FOREACH(opts, &list->head, next) {
+        count++;
+    }
+
+    entries = g_new0(QemuOptsSortEntry, count);
+    QTAILQ_FOREACH_SAFE(opts, &list->head, next, next_opts) {
+        entry = &entries[i++];
+        entry->opts = opts;
+        entry->priority = fn(opts);
+        /* Temporarily remove them; will add them back later */
+        QTAILQ_REMOVE(&list->head, opts, next);
+    }
+
+    qsort(entries, count, sizeof(QemuOptsSortEntry), qemu_opts_cmp_fn);
+
+    for (i = 0; i < count; i++) {
+        QTAILQ_INSERT_TAIL(&list->head, entries[i].opts, next);
+    }
+
+    g_free(entries);
+}
+
 QemuOpts *qemu_find_opts_singleton(const char *group)
 {
     QemuOptsList *list;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/4] qdev: Export qdev_get_device_class()
  2021-08-18 19:42 [PATCH 0/4] vl: Prioritize device realizations Peter Xu
  2021-08-18 19:42 ` [PATCH 1/4] qdev-monitor: Trace qdev creation Peter Xu
  2021-08-18 19:43 ` [PATCH 2/4] qemu-config: Allow in-place sorting of QemuOptsList Peter Xu
@ 2021-08-18 19:43 ` Peter Xu
  2021-08-18 19:43 ` [PATCH 4/4] vl: Prioritize realizations of devices Peter Xu
  2021-10-20 13:44 ` [PATCH 0/4] vl: Prioritize device realizations David Hildenbrand
  4 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-18 19:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, peterx, Dr . David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini

It'll be used outside the current source file.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/monitor/qdev.h | 2 ++
 softmmu/qdev-monitor.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/monitor/qdev.h b/include/monitor/qdev.h
index eaa947d73a..a783ad35b9 100644
--- a/include/monitor/qdev.h
+++ b/include/monitor/qdev.h
@@ -2,12 +2,14 @@
 #define MONITOR_QDEV_H
 
 /*** monitor commands ***/
+#include "hw/qdev-core.h"
 
 void hmp_info_qtree(Monitor *mon, const QDict *qdict);
 void hmp_info_qdm(Monitor *mon, const QDict *qdict);
 void qmp_device_add(QDict *qdict, QObject **ret_data, Error **errp);
 
 int qdev_device_help(QemuOpts *opts);
+DeviceClass *qdev_get_device_class(const char **driver, Error **errp);
 DeviceState *qdev_device_add(QemuOpts *opts, Error **errp);
 void qdev_set_id(DeviceState *dev, const char *id);
 
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 8602164082..610745467c 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -220,7 +220,7 @@ static const char *find_typename_by_alias(const char *alias)
     return NULL;
 }
 
-static DeviceClass *qdev_get_device_class(const char **driver, Error **errp)
+DeviceClass *qdev_get_device_class(const char **driver, Error **errp)
 {
     ObjectClass *oc;
     DeviceClass *dc;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-18 19:42 [PATCH 0/4] vl: Prioritize device realizations Peter Xu
                   ` (2 preceding siblings ...)
  2021-08-18 19:43 ` [PATCH 3/4] qdev: Export qdev_get_device_class() Peter Xu
@ 2021-08-18 19:43 ` Peter Xu
  2021-08-23 18:49   ` Eduardo Habkost
  2021-10-20 13:44 ` [PATCH 0/4] vl: Prioritize device realizations David Hildenbrand
  4 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-08-18 19:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, peterx, Dr . David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini

QEMU creates -device objects in order as specified by the user's cmdline.
However that ordering may not be the ideal order.  For example, some platform
devices (vIOMMUs) may want to be created earlier than most of the rest
devices (e.g., vfio-pci, virtio).

This patch orders the QemuOptsList of '-device's so they'll be sorted first
before kicking off the device realizations.  This will allow the device
realization code to be able to use APIs like pci_device_iommu_address_space()
correctly, because those functions rely on the platfrom devices being realized.

Now we rely on vmsd->priority which is defined as MigrationPriority to provide
the ordering, as either VM init and migration completes will need such an
ordering.  In the future we can move that priority information out of vmsd.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 softmmu/vl.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 5ca11e7469..3a30dfe27d 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -126,6 +126,7 @@
 #include "qapi/qmp/qerror.h"
 #include "sysemu/iothread.h"
 #include "qemu/guest-random.h"
+#include "migration/vmstate.h"
 
 #include "config-host.h"
 
@@ -2627,6 +2628,35 @@ static void qemu_init_board(void)
     }
 }
 
+/* Return the priority of the device option; zero is the default priority */
+static int qemu_opts_device_priority(QemuOpts *opts)
+{
+    const char *driver;
+    DeviceClass *dc;
+
+    driver = qemu_opt_get(opts, "driver");
+    if (!driver) {
+        return 0;
+    }
+
+    dc = qdev_get_device_class(&driver, NULL);
+    if (!dc) {
+        return 0;
+    }
+
+    if (!dc->vmsd) {
+        return 0;
+    }
+
+    /*
+     * Currently we rely on vmsd priority because that's solving the same
+     * problem for device realization ordering but just for migration.  In the
+     * future, we can move it out of vmsd, but that's not urgently needed.
+     * Return the negative of it so it'll be sorted with descendant order.
+     */
+    return -dc->vmsd->priority;
+}
+
 static void qemu_create_cli_devices(void)
 {
     soundhw_init();
@@ -2642,6 +2672,11 @@ static void qemu_create_cli_devices(void)
 
     /* init generic devices */
     rom_set_order_override(FW_CFG_ORDER_OVERRIDE_DEVICE);
+    /*
+     * Sort all the -device parameters; e.g., platform devices like vIOMMU
+     * should be initialized earlier.
+     */
+    qemu_sort_opts("device", qemu_opts_device_priority);
     qemu_opts_foreach(qemu_find_opts("device"),
                       device_init_func, NULL, &error_fatal);
     rom_reset_order_override();
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-18 19:43 ` [PATCH 4/4] vl: Prioritize realizations of devices Peter Xu
@ 2021-08-23 18:49   ` Eduardo Habkost
  2021-08-23 19:18     ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Eduardo Habkost @ 2021-08-23 18:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Michael S . Tsirkin, Jason Wang, qemu-devel, Markus Armbruster,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> QEMU creates -device objects in order as specified by the user's cmdline.
> However that ordering may not be the ideal order.  For example, some platform
> devices (vIOMMUs) may want to be created earlier than most of the rest
> devices (e.g., vfio-pci, virtio).
> 
> This patch orders the QemuOptsList of '-device's so they'll be sorted first
> before kicking off the device realizations.  This will allow the device
> realization code to be able to use APIs like pci_device_iommu_address_space()
> correctly, because those functions rely on the platfrom devices being realized.
> 
> Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> the ordering, as either VM init and migration completes will need such an
> ordering.  In the future we can move that priority information out of vmsd.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Can we be 100% sure that changing the ordering of every single
device being created won't affect guest ABI?  (I don't think we can)

How many device types in QEMU have non-default vmsd priority?

Can we at least ensure devices with the same priority won't be
reordered, just to be safe?  (qsort() doesn't guarantee that)

If very few device types have non-default vmsd priority and
devices with the same priority aren't reordered, the risk of
compatibility breakage would be much smaller.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 18:49   ` Eduardo Habkost
@ 2021-08-23 19:18     ` Peter Xu
  2021-08-23 21:07       ` Eduardo Habkost
                         ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-23 19:18 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Daniel P . Berrangé,
	Michael S . Tsirkin, Jason Wang, qemu-devel, Markus Armbruster,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > QEMU creates -device objects in order as specified by the user's cmdline.
> > However that ordering may not be the ideal order.  For example, some platform
> > devices (vIOMMUs) may want to be created earlier than most of the rest
> > devices (e.g., vfio-pci, virtio).
> > 
> > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > before kicking off the device realizations.  This will allow the device
> > realization code to be able to use APIs like pci_device_iommu_address_space()
> > correctly, because those functions rely on the platfrom devices being realized.
> > 
> > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > the ordering, as either VM init and migration completes will need such an
> > ordering.  In the future we can move that priority information out of vmsd.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Can we be 100% sure that changing the ordering of every single
> device being created won't affect guest ABI?  (I don't think we can)

That's a good question, however I doubt whether there's any real-world guest
ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
way, so that I assume most parameters are not sensitive to ordering and I can
tune the ordering as wish.  I'm not sure whether that's common for qemu users,
I would expect so, but I may have missed something that I'm not aware of.

Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
to be before "intel-iommu": it'll be constantly broken before this patchset,
while after this series it'll be working.  It's just that I don't think those
"guest ABI" is necessary to be kept, and that's exactly what I want to fix with
the patchset..

> 
> How many device types in QEMU have non-default vmsd priority?

Not so much; here's the list of priorities and the devices using it:

       |--------------------+---------|
       | priority           | devices |
       |--------------------+---------|
       | MIG_PRI_IOMMU      |       3 |
       | MIG_PRI_PCI_BUS    |       7 |
       | MIG_PRI_VIRTIO_MEM |       1 |
       | MIG_PRI_GICV3_ITS  |       1 |
       | MIG_PRI_GICV3      |       1 |
       |--------------------+---------|

All the rest devices are using the default (0) priority.

> 
> Can we at least ensure devices with the same priority won't be
> reordered, just to be safe?  (qsort() doesn't guarantee that)
> 
> If very few device types have non-default vmsd priority and
> devices with the same priority aren't reordered, the risk of
> compatibility breakage would be much smaller.

I'm also wondering whether it's a good thing to break some guest ABI due to
this change, if possible.

Let's imagine something breaks after applied, then the only reason should be
that qsort() changed the order of some same-priority devices and it's not the
same as user specified any more.  Then, does it also means there's yet another
ordering requirement that we didn't even notice?

I doubt whether that'll even happen (or I think there'll be report already, as
in qemu man page there's no requirement on parameter ordering).  In all cases,
instead of "keeping the same priority devices in the same order as the user has
specified", IMHO we should make the broken devices to have different priorities
so the ordering will be guaranteed by qemu internal, rather than how user
specified it.

From that pov, maybe this patchset would be great if it can be accepted and
applied in early stage of a release? So we can figure out what's missing and
fix them within the same release.  However again I still doubt whether there's
any user that will break in a bad way.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 19:18     ` Peter Xu
@ 2021-08-23 21:07       ` Eduardo Habkost
  2021-08-23 21:31         ` Peter Xu
  2021-08-23 22:05       ` Michael S. Tsirkin
  2021-08-24  2:51       ` Jason Wang
  2 siblings, 1 reply; 52+ messages in thread
From: Eduardo Habkost @ 2021-08-23 21:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Michael S . Tsirkin, Jason Wang, qemu-devel, Markus Armbruster,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > However that ordering may not be the ideal order.  For example, some platform
> > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > devices (e.g., vfio-pci, virtio).
> > > 
> > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > before kicking off the device realizations.  This will allow the device
> > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > correctly, because those functions rely on the platfrom devices being realized.
> > > 
> > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > the ordering, as either VM init and migration completes will need such an
> > > ordering.  In the future we can move that priority information out of vmsd.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Can we be 100% sure that changing the ordering of every single
> > device being created won't affect guest ABI?  (I don't think we can)
> 
> That's a good question, however I doubt whether there's any real-world guest
> ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> way, so that I assume most parameters are not sensitive to ordering and I can
> tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> I would expect so, but I may have missed something that I'm not aware of.

To give just one example:

$ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20
  Bus  0, device   4, function 0:
    Ethernet controller: PCI device 1af4:1000
      PCI subsystem 1af4:0001
      IRQ 0, pin A
      BAR0: I/O at 0xffffffffffffffff [0x001e].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
      BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id ""
  Bus  0, device   5, function 0:
    Ethernet controller: PCI device 8086:10d3
      PCI subsystem 8086:0000
      IRQ 0, pin A
      BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR2: I/O at 0xffffffffffffffff [0x001e].
      BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id ""
(qemu) quit
$ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20
  Bus  0, device   4, function 0:
    Ethernet controller: PCI device 8086:10d3
      PCI subsystem 8086:0000
      IRQ 0, pin A
      BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR2: I/O at 0xffffffffffffffff [0x001e].
      BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id ""
  Bus  0, device   5, function 0:
    Ethernet controller: PCI device 1af4:1000
      PCI subsystem 1af4:0001
      IRQ 0, pin A
      BAR0: I/O at 0xffffffffffffffff [0x001e].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
      BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id ""
(qemu) quit


If the order of the -device arguments changes, the devices are assigned to
different PCI slots.


> 
> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> to be before "intel-iommu": it'll be constantly broken before this patchset,
> while after this series it'll be working.  It's just that I don't think those
> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> the patchset..

If the only ordering changes caused by this patch were intentional and affected
only configurations that are known to be broken (like vfio-pci vs intel-iommu),
I would agree.

However, if we are reordering every single -device option in an unspecified way
(like qsort() does when elements compare as equal), we are probably breaking
guest ABI and creating a completely different machine (like in the PCI example above).


> 
> > 
> > How many device types in QEMU have non-default vmsd priority?
> 
> Not so much; here's the list of priorities and the devices using it:
> 
>        |--------------------+---------|
>        | priority           | devices |
>        |--------------------+---------|
>        | MIG_PRI_IOMMU      |       3 |
>        | MIG_PRI_PCI_BUS    |       7 |
>        | MIG_PRI_VIRTIO_MEM |       1 |
>        | MIG_PRI_GICV3_ITS  |       1 |
>        | MIG_PRI_GICV3      |       1 |
>        |--------------------+---------|
> 
> All the rest devices are using the default (0) priority.
> 
> > 
> > Can we at least ensure devices with the same priority won't be
> > reordered, just to be safe?  (qsort() doesn't guarantee that)
> > 
> > If very few device types have non-default vmsd priority and
> > devices with the same priority aren't reordered, the risk of
> > compatibility breakage would be much smaller.
> 
> I'm also wondering whether it's a good thing to break some guest ABI due to
> this change, if possible.
> 
> Let's imagine something breaks after applied, then the only reason should be
> that qsort() changed the order of some same-priority devices and it's not the
> same as user specified any more.  Then, does it also means there's yet another
> ordering requirement that we didn't even notice?

Does the PCI slot assignment example above demonstrate a ordering requirement?
I don't think it does.  It just demonstrates that different orderings are
expected to give different results, but both orderings are valid.


> 
> I doubt whether that'll even happen (or I think there'll be report already, as
> in qemu man page there's no requirement on parameter ordering).  In all cases,
> instead of "keeping the same priority devices in the same order as the user has
> specified", IMHO we should make the broken devices to have different priorities
> so the ordering will be guaranteed by qemu internal, rather than how user
> specified it.
> 
> From that pov, maybe this patchset would be great if it can be accepted and
> applied in early stage of a release? So we can figure out what's missing and
> fix them within the same release.  However again I still doubt whether there's
> any user that will break in a bad way.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 

-- 
Eduardo



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 21:07       ` Eduardo Habkost
@ 2021-08-23 21:31         ` Peter Xu
  2021-08-23 21:54           ` Michael S. Tsirkin
  2021-08-23 21:56           ` Eduardo Habkost
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-23 21:31 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Daniel P . Berrangé,
	Michael S . Tsirkin, Jason Wang, qemu-devel, Markus Armbruster,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote:
> To give just one example:
> 
> $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20
>   Bus  0, device   4, function 0:
>     Ethernet controller: PCI device 1af4:1000
>       PCI subsystem 1af4:0001
>       IRQ 0, pin A
>       BAR0: I/O at 0xffffffffffffffff [0x001e].
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
>       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
>       id ""
>   Bus  0, device   5, function 0:
>     Ethernet controller: PCI device 8086:10d3
>       PCI subsystem 8086:0000
>       IRQ 0, pin A
>       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
>       BAR2: I/O at 0xffffffffffffffff [0x001e].
>       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
>       id ""
> (qemu) quit
> $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20
>   Bus  0, device   4, function 0:
>     Ethernet controller: PCI device 8086:10d3
>       PCI subsystem 8086:0000
>       IRQ 0, pin A
>       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
>       BAR2: I/O at 0xffffffffffffffff [0x001e].
>       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
>       id ""
>   Bus  0, device   5, function 0:
>     Ethernet controller: PCI device 1af4:1000
>       PCI subsystem 1af4:0001
>       IRQ 0, pin A
>       BAR0: I/O at 0xffffffffffffffff [0x001e].
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
>       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
>       id ""
> (qemu) quit
> 
> 
> If the order of the -device arguments changes, the devices are assigned to
> different PCI slots.

Thanks for the example.

Initially I thought about this and didn't think it an issue (because serious
users will always specify addr=XXX for -device; I thought libvirt always does
that), but I do remember that guest OS could identify its hardware config with
devfn number, so nmcli may mess up its config with before/after this change
indeed..

I can use a custom sort to replace qsort() to guarantee that.

Do you have other examples in mind that I may have overlooked, especially I may
not be able to fix by a custom sort with only moving priority>=1 devices?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 21:31         ` Peter Xu
@ 2021-08-23 21:54           ` Michael S. Tsirkin
  2021-08-23 22:51             ` Peter Xu
  2021-08-23 21:56           ` Eduardo Habkost
  1 sibling, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2021-08-23 21:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Jason Wang, Markus Armbruster, qemu-devel,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 05:31:46PM -0400, Peter Xu wrote:
> On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote:
> > To give just one example:
> > 
> > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20
> >   Bus  0, device   4, function 0:
> >     Ethernet controller: PCI device 1af4:1000
> >       PCI subsystem 1af4:0001
> >       IRQ 0, pin A
> >       BAR0: I/O at 0xffffffffffffffff [0x001e].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> >   Bus  0, device   5, function 0:
> >     Ethernet controller: PCI device 8086:10d3
> >       PCI subsystem 8086:0000
> >       IRQ 0, pin A
> >       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR2: I/O at 0xffffffffffffffff [0x001e].
> >       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> > (qemu) quit
> > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20
> >   Bus  0, device   4, function 0:
> >     Ethernet controller: PCI device 8086:10d3
> >       PCI subsystem 8086:0000
> >       IRQ 0, pin A
> >       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR2: I/O at 0xffffffffffffffff [0x001e].
> >       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> >   Bus  0, device   5, function 0:
> >     Ethernet controller: PCI device 1af4:1000
> >       PCI subsystem 1af4:0001
> >       IRQ 0, pin A
> >       BAR0: I/O at 0xffffffffffffffff [0x001e].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> > (qemu) quit
> > 
> > 
> > If the order of the -device arguments changes, the devices are assigned to
> > different PCI slots.
> 
> Thanks for the example.
> 
> Initially I thought about this and didn't think it an issue (because serious
> users will always specify addr=XXX for -device; I thought libvirt always does
> that), but I do remember that guest OS could identify its hardware config with
> devfn number, so nmcli may mess up its config with before/after this change
> indeed..
> 
> I can use a custom sort to replace qsort() to guarantee that.


You don't have to do that. Simply use the device position on the command
line for comparisons when priority is the same.


> Do you have other examples in mind that I may have overlooked, especially I may
> not be able to fix by a custom sort with only moving priority>=1 devices?
> 
> Thanks,

> -- 
> Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 21:31         ` Peter Xu
  2021-08-23 21:54           ` Michael S. Tsirkin
@ 2021-08-23 21:56           ` Eduardo Habkost
  2021-08-23 23:05             ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Eduardo Habkost @ 2021-08-23 21:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Michael S . Tsirkin, Jason Wang, qemu-devel, Markus Armbruster,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 05:31:46PM -0400, Peter Xu wrote:
> On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote:
> > To give just one example:
> > 
> > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20
> >   Bus  0, device   4, function 0:
> >     Ethernet controller: PCI device 1af4:1000
> >       PCI subsystem 1af4:0001
> >       IRQ 0, pin A
> >       BAR0: I/O at 0xffffffffffffffff [0x001e].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> >   Bus  0, device   5, function 0:
> >     Ethernet controller: PCI device 8086:10d3
> >       PCI subsystem 8086:0000
> >       IRQ 0, pin A
> >       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR2: I/O at 0xffffffffffffffff [0x001e].
> >       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> > (qemu) quit
> > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20
> >   Bus  0, device   4, function 0:
> >     Ethernet controller: PCI device 8086:10d3
> >       PCI subsystem 8086:0000
> >       IRQ 0, pin A
> >       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR2: I/O at 0xffffffffffffffff [0x001e].
> >       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> >   Bus  0, device   5, function 0:
> >     Ethernet controller: PCI device 1af4:1000
> >       PCI subsystem 1af4:0001
> >       IRQ 0, pin A
> >       BAR0: I/O at 0xffffffffffffffff [0x001e].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> > (qemu) quit
> > 
> > 
> > If the order of the -device arguments changes, the devices are assigned to
> > different PCI slots.
> 
> Thanks for the example.
> 
> Initially I thought about this and didn't think it an issue (because serious
> users will always specify addr=XXX for -device; I thought libvirt always does
> that), but I do remember that guest OS could identify its hardware config with
> devfn number, so nmcli may mess up its config with before/after this change
> indeed..
> 
> I can use a custom sort to replace qsort() to guarantee that.
> 
> Do you have other examples in mind that I may have overlooked, especially I may
> not be able to fix by a custom sort with only moving priority>=1 devices?

I don't have any other example, but I assume address assignment
based on ordering is a common pattern in device code.

I would take a very close and careful look at the devices with
non-default vmsd priority.  If you can prove that the 13 device
types with non-default priority are all order-insensitive, a
custom sort function as you describe might be safe.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 19:18     ` Peter Xu
  2021-08-23 21:07       ` Eduardo Habkost
@ 2021-08-23 22:05       ` Michael S. Tsirkin
  2021-08-23 22:36         ` Peter Xu
  2021-08-24 16:24         ` David Hildenbrand
  2021-08-24  2:51       ` Jason Wang
  2 siblings, 2 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2021-08-23 22:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, David Hildenbrand, Jason Wang,
	Markus Armbruster, qemu-devel, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > However that ordering may not be the ideal order.  For example, some platform
> > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > devices (e.g., vfio-pci, virtio).
> > > 
> > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > before kicking off the device realizations.  This will allow the device
> > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > correctly, because those functions rely on the platfrom devices being realized.
> > > 
> > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > the ordering, as either VM init and migration completes will need such an
> > > ordering.  In the future we can move that priority information out of vmsd.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Can we be 100% sure that changing the ordering of every single
> > device being created won't affect guest ABI?  (I don't think we can)
> 
> That's a good question, however I doubt whether there's any real-world guest
> ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> way, so that I assume most parameters are not sensitive to ordering and I can
> tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> I would expect so, but I may have missed something that I'm not aware of.
> 
> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> to be before "intel-iommu": it'll be constantly broken before this patchset,
> while after this series it'll be working.  It's just that I don't think those
> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> the patchset..
> 
> > 
> > How many device types in QEMU have non-default vmsd priority?
> 
> Not so much; here's the list of priorities and the devices using it:
> 
>        |--------------------+---------|
>        | priority           | devices |
>        |--------------------+---------|
>        | MIG_PRI_IOMMU      |       3 |
>        | MIG_PRI_PCI_BUS    |       7 |
>        | MIG_PRI_VIRTIO_MEM |       1 |
>        | MIG_PRI_GICV3_ITS  |       1 |
>        | MIG_PRI_GICV3      |       1 |
>        |--------------------+---------|

iommu is probably ok. I think virtio mem is ok too,
in that it is normally created by virtio-mem-pci ...



> All the rest devices are using the default (0) priority.
> 
> > 
> > Can we at least ensure devices with the same priority won't be
> > reordered, just to be safe?  (qsort() doesn't guarantee that)
> > 
> > If very few device types have non-default vmsd priority and
> > devices with the same priority aren't reordered, the risk of
> > compatibility breakage would be much smaller.
> 
> I'm also wondering whether it's a good thing to break some guest ABI due to
> this change, if possible.
> 
> Let's imagine something breaks after applied, then the only reason should be
> that qsort() changed the order of some same-priority devices and it's not the
> same as user specified any more.  Then, does it also means there's yet another
> ordering requirement that we didn't even notice?
> 
> I doubt whether that'll even happen (or I think there'll be report already, as
> in qemu man page there's no requirement on parameter ordering).  In all cases,
> instead of "keeping the same priority devices in the same order as the user has
> specified", IMHO we should make the broken devices to have different priorities
> so the ordering will be guaranteed by qemu internal, rather than how user
> specified it.

Well giving user control of guest ABI is a reasonable thing to do,
it is realize order that users do not really care about.

I guess we could move pci slot allocation out of realize
so it does not depend on realize order?


> >From that pov, maybe this patchset would be great if it can be accepted and
> applied in early stage of a release? So we can figure out what's missing and
> fix them within the same release.  However again I still doubt whether there's
> any user that will break in a bad way.
> 
> Thanks,
> 
> -- 
> Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 22:05       ` Michael S. Tsirkin
@ 2021-08-23 22:36         ` Peter Xu
  2021-08-24  2:52           ` Jason Wang
  2021-08-24 16:24         ` David Hildenbrand
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-08-23 22:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, David Hildenbrand, Jason Wang,
	Markus Armbruster, qemu-devel, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 06:05:07PM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
> > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > > However that ordering may not be the ideal order.  For example, some platform
> > > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > > devices (e.g., vfio-pci, virtio).
> > > > 
> > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > > before kicking off the device realizations.  This will allow the device
> > > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > > correctly, because those functions rely on the platfrom devices being realized.
> > > > 
> > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > > the ordering, as either VM init and migration completes will need such an
> > > > ordering.  In the future we can move that priority information out of vmsd.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > Can we be 100% sure that changing the ordering of every single
> > > device being created won't affect guest ABI?  (I don't think we can)
> > 
> > That's a good question, however I doubt whether there's any real-world guest
> > ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> > way, so that I assume most parameters are not sensitive to ordering and I can
> > tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> > I would expect so, but I may have missed something that I'm not aware of.
> > 
> > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> > to be before "intel-iommu": it'll be constantly broken before this patchset,
> > while after this series it'll be working.  It's just that I don't think those
> > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> > the patchset..
> > 
> > > 
> > > How many device types in QEMU have non-default vmsd priority?
> > 
> > Not so much; here's the list of priorities and the devices using it:
> > 
> >        |--------------------+---------|
> >        | priority           | devices |
> >        |--------------------+---------|
> >        | MIG_PRI_IOMMU      |       3 |
> >        | MIG_PRI_PCI_BUS    |       7 |
> >        | MIG_PRI_VIRTIO_MEM |       1 |
> >        | MIG_PRI_GICV3_ITS  |       1 |
> >        | MIG_PRI_GICV3      |       1 |
> >        |--------------------+---------|
> 
> iommu is probably ok. I think virtio mem is ok too,
> in that it is normally created by virtio-mem-pci ...

Hmm this reminded me whether virtio-mem-pci could have another devfn allocated
after being moved..

But frankly I still doubt whether we should guarantee that guest ABI on user
not specifying addr=XXX in pci device parameters - I feel like it's a burden
that we don't need to carry.

(Btw, trying to keep the order is one thing; declare it guest ABI would be
 another thing to me)

> 
> 
> 
> > All the rest devices are using the default (0) priority.
> > 
> > > 
> > > Can we at least ensure devices with the same priority won't be
> > > reordered, just to be safe?  (qsort() doesn't guarantee that)
> > > 
> > > If very few device types have non-default vmsd priority and
> > > devices with the same priority aren't reordered, the risk of
> > > compatibility breakage would be much smaller.
> > 
> > I'm also wondering whether it's a good thing to break some guest ABI due to
> > this change, if possible.
> > 
> > Let's imagine something breaks after applied, then the only reason should be
> > that qsort() changed the order of some same-priority devices and it's not the
> > same as user specified any more.  Then, does it also means there's yet another
> > ordering requirement that we didn't even notice?
> > 
> > I doubt whether that'll even happen (or I think there'll be report already, as
> > in qemu man page there's no requirement on parameter ordering).  In all cases,
> > instead of "keeping the same priority devices in the same order as the user has
> > specified", IMHO we should make the broken devices to have different priorities
> > so the ordering will be guaranteed by qemu internal, rather than how user
> > specified it.
> 
> Well giving user control of guest ABI is a reasonable thing to do,
> it is realize order that users do not really care about.

Makes sense.

> 
> I guess we could move pci slot allocation out of realize
> so it does not depend on realize order?

Yes that sounds like another approach, but it seems to require more changes.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 21:54           ` Michael S. Tsirkin
@ 2021-08-23 22:51             ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-23 22:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Jason Wang, Markus Armbruster, qemu-devel,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 05:54:44PM -0400, Michael S. Tsirkin wrote:
> > I can use a custom sort to replace qsort() to guarantee that.
> You don't have to do that. Simply use the device position on the command
> line for comparisons when priority is the same.

Indeed. :) Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 21:56           ` Eduardo Habkost
@ 2021-08-23 23:05             ` Peter Xu
  2021-08-25  9:39               ` Markus Armbruster
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-08-23 23:05 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Daniel P . Berrangé,
	Michael S . Tsirkin, Jason Wang, qemu-devel, Markus Armbruster,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
> I don't have any other example, but I assume address assignment
> based on ordering is a common pattern in device code.
> 
> I would take a very close and careful look at the devices with
> non-default vmsd priority.  If you can prove that the 13 device
> types with non-default priority are all order-insensitive, a
> custom sort function as you describe might be safe.

Besides virtio-mem-pci, there'll also similar devfn issue with all
MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
below two cmdlines will generate different pci topology too:

  $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
                       -device pcie-root-port,chassis=1 \
                       -device virtio-net-pci

And:

  $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
                       -device virtio-net-pci
                       -device pcie-root-port,chassis=1 \

This cannot be solved by keeping priority==0 ordering.

After a second thought, I think I was initially wrong on seeing migration
priority and device realization the same problem.

For example, for live migration we have a requirement on PCI_BUS being migrated
earlier than MIG_PRI_IOMMU because there's bus number information required
because IOMMU relies on the bus number to find address spaces.  However that's
definitely not a requirement for device realizations, say, realizing vIOMMU
after pci buses are fine (bus assigned during bios).

I've probably messed up with the ideas (though they really look alike!).  Sorry
about that.

Since the only ordering constraint so far is IOMMU vs all the rest of devices,
I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
earlier.  That'll also avoid other implications on pci devfn allocations.

Will rework a new version tomorrow.  Thanks a lot for all the comments,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 19:18     ` Peter Xu
  2021-08-23 21:07       ` Eduardo Habkost
  2021-08-23 22:05       ` Michael S. Tsirkin
@ 2021-08-24  2:51       ` Jason Wang
  2 siblings, 0 replies; 52+ messages in thread
From: Jason Wang @ 2021-08-24  2:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Markus Armbruster,
	qemu-devel, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Tue, Aug 24, 2021 at 3:18 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > However that ordering may not be the ideal order.  For example, some platform
> > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > devices (e.g., vfio-pci, virtio).
> > >
> > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > before kicking off the device realizations.  This will allow the device
> > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > correctly, because those functions rely on the platfrom devices being realized.
> > >
> > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > the ordering, as either VM init and migration completes will need such an
> > > ordering.  In the future we can move that priority information out of vmsd.
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> >
> > Can we be 100% sure that changing the ordering of every single
> > device being created won't affect guest ABI?  (I don't think we can)
>
> That's a good question, however I doubt whether there's any real-world guest
> ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> way, so that I assume most parameters are not sensitive to ordering and I can
> tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> I would expect so, but I may have missed something that I'm not aware of.
>
> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> to be before "intel-iommu": it'll be constantly broken before this patchset,
> while after this series it'll be working.  It's just that I don't think those
> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> the patchset..

Yes, and I wonder if we limit this to new machine types, we don't even
need to care about ABI stuff.

Thanks

>
> >
> > How many device types in QEMU have non-default vmsd priority?
>
> Not so much; here's the list of priorities and the devices using it:
>
>        |--------------------+---------|
>        | priority           | devices |
>        |--------------------+---------|
>        | MIG_PRI_IOMMU      |       3 |
>        | MIG_PRI_PCI_BUS    |       7 |
>        | MIG_PRI_VIRTIO_MEM |       1 |
>        | MIG_PRI_GICV3_ITS  |       1 |
>        | MIG_PRI_GICV3      |       1 |
>        |--------------------+---------|
>
> All the rest devices are using the default (0) priority.
>
> >
> > Can we at least ensure devices with the same priority won't be
> > reordered, just to be safe?  (qsort() doesn't guarantee that)
> >
> > If very few device types have non-default vmsd priority and
> > devices with the same priority aren't reordered, the risk of
> > compatibility breakage would be much smaller.
>
> I'm also wondering whether it's a good thing to break some guest ABI due to
> this change, if possible.
>
> Let's imagine something breaks after applied, then the only reason should be
> that qsort() changed the order of some same-priority devices and it's not the
> same as user specified any more.  Then, does it also means there's yet another
> ordering requirement that we didn't even notice?
>
> I doubt whether that'll even happen (or I think there'll be report already, as
> in qemu man page there's no requirement on parameter ordering).  In all cases,
> instead of "keeping the same priority devices in the same order as the user has
> specified", IMHO we should make the broken devices to have different priorities
> so the ordering will be guaranteed by qemu internal, rather than how user
> specified it.
>
> From that pov, maybe this patchset would be great if it can be accepted and
> applied in early stage of a release? So we can figure out what's missing and
> fix them within the same release.  However again I still doubt whether there's
> any user that will break in a bad way.
>
> Thanks,
>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 22:36         ` Peter Xu
@ 2021-08-24  2:52           ` Jason Wang
  2021-08-24 15:50             ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Wang @ 2021-08-24  2:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	qemu-devel, Markus Armbruster, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

On Tue, Aug 24, 2021 at 6:37 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Aug 23, 2021 at 06:05:07PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
> > > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > > > However that ordering may not be the ideal order.  For example, some platform
> > > > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > > > devices (e.g., vfio-pci, virtio).
> > > > >
> > > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > > > before kicking off the device realizations.  This will allow the device
> > > > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > > > correctly, because those functions rely on the platfrom devices being realized.
> > > > >
> > > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > > > the ordering, as either VM init and migration completes will need such an
> > > > > ordering.  In the future we can move that priority information out of vmsd.
> > > > >
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > >
> > > > Can we be 100% sure that changing the ordering of every single
> > > > device being created won't affect guest ABI?  (I don't think we can)
> > >
> > > That's a good question, however I doubt whether there's any real-world guest
> > > ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> > > way, so that I assume most parameters are not sensitive to ordering and I can
> > > tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> > > I would expect so, but I may have missed something that I'm not aware of.
> > >
> > > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> > > to be before "intel-iommu": it'll be constantly broken before this patchset,
> > > while after this series it'll be working.  It's just that I don't think those
> > > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> > > the patchset..
> > >
> > > >
> > > > How many device types in QEMU have non-default vmsd priority?
> > >
> > > Not so much; here's the list of priorities and the devices using it:
> > >
> > >        |--------------------+---------|
> > >        | priority           | devices |
> > >        |--------------------+---------|
> > >        | MIG_PRI_IOMMU      |       3 |
> > >        | MIG_PRI_PCI_BUS    |       7 |
> > >        | MIG_PRI_VIRTIO_MEM |       1 |
> > >        | MIG_PRI_GICV3_ITS  |       1 |
> > >        | MIG_PRI_GICV3      |       1 |
> > >        |--------------------+---------|
> >
> > iommu is probably ok. I think virtio mem is ok too,
> > in that it is normally created by virtio-mem-pci ...
>
> Hmm this reminded me whether virtio-mem-pci could have another devfn allocated
> after being moved..
>
> But frankly I still doubt whether we should guarantee that guest ABI on user
> not specifying addr=XXX in pci device parameters - I feel like it's a burden
> that we don't need to carry.
>
> (Btw, trying to keep the order is one thing; declare it guest ABI would be
>  another thing to me)
>
> >
> >
> >
> > > All the rest devices are using the default (0) priority.
> > >
> > > >
> > > > Can we at least ensure devices with the same priority won't be
> > > > reordered, just to be safe?  (qsort() doesn't guarantee that)
> > > >
> > > > If very few device types have non-default vmsd priority and
> > > > devices with the same priority aren't reordered, the risk of
> > > > compatibility breakage would be much smaller.
> > >
> > > I'm also wondering whether it's a good thing to break some guest ABI due to
> > > this change, if possible.
> > >
> > > Let's imagine something breaks after applied, then the only reason should be
> > > that qsort() changed the order of some same-priority devices and it's not the
> > > same as user specified any more.  Then, does it also means there's yet another
> > > ordering requirement that we didn't even notice?
> > >
> > > I doubt whether that'll even happen (or I think there'll be report already, as
> > > in qemu man page there's no requirement on parameter ordering).  In all cases,
> > > instead of "keeping the same priority devices in the same order as the user has
> > > specified", IMHO we should make the broken devices to have different priorities
> > > so the ordering will be guaranteed by qemu internal, rather than how user
> > > specified it.
> >
> > Well giving user control of guest ABI is a reasonable thing to do,
> > it is realize order that users do not really care about.
>
> Makes sense.
>
> >
> > I guess we could move pci slot allocation out of realize
> > so it does not depend on realize order?
>
> Yes that sounds like another approach, but it seems to require more changes.

It looks to me this doesn't solve the issue of using virtio-mmio with vhost?

Thanks

>
> Thanks,
>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-24  2:52           ` Jason Wang
@ 2021-08-24 15:50             ` Peter Xu
  2021-08-25  4:23               ` Jason Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-08-24 15:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	qemu-devel, Markus Armbruster, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote:
> It looks to me this doesn't solve the issue of using virtio-mmio with vhost?

No IOMMU supported for any of the MMIO devices, right?  Or am I wrong?  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 22:05       ` Michael S. Tsirkin
  2021-08-23 22:36         ` Peter Xu
@ 2021-08-24 16:24         ` David Hildenbrand
  2021-08-24 19:52           ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2021-08-24 16:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, Peter Xu
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Jason Wang, Markus Armbruster, qemu-devel,
	Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On 24.08.21 00:05, Michael S. Tsirkin wrote:
> On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
>> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
>>> On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
>>>> QEMU creates -device objects in order as specified by the user's cmdline.
>>>> However that ordering may not be the ideal order.  For example, some platform
>>>> devices (vIOMMUs) may want to be created earlier than most of the rest
>>>> devices (e.g., vfio-pci, virtio).
>>>>
>>>> This patch orders the QemuOptsList of '-device's so they'll be sorted first
>>>> before kicking off the device realizations.  This will allow the device
>>>> realization code to be able to use APIs like pci_device_iommu_address_space()
>>>> correctly, because those functions rely on the platfrom devices being realized.
>>>>
>>>> Now we rely on vmsd->priority which is defined as MigrationPriority to provide
>>>> the ordering, as either VM init and migration completes will need such an
>>>> ordering.  In the future we can move that priority information out of vmsd.
>>>>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Can we be 100% sure that changing the ordering of every single
>>> device being created won't affect guest ABI?  (I don't think we can)
>>
>> That's a good question, however I doubt whether there's any real-world guest
>> ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
>> way, so that I assume most parameters are not sensitive to ordering and I can
>> tune the ordering as wish.  I'm not sure whether that's common for qemu users,
>> I would expect so, but I may have missed something that I'm not aware of.
>>
>> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
>> to be before "intel-iommu": it'll be constantly broken before this patchset,
>> while after this series it'll be working.  It's just that I don't think those
>> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
>> the patchset..
>>
>>>
>>> How many device types in QEMU have non-default vmsd priority?
>>
>> Not so much; here's the list of priorities and the devices using it:
>>
>>         |--------------------+---------|
>>         | priority           | devices |
>>         |--------------------+---------|
>>         | MIG_PRI_IOMMU      |       3 |
>>         | MIG_PRI_PCI_BUS    |       7 |
>>         | MIG_PRI_VIRTIO_MEM |       1 |
>>         | MIG_PRI_GICV3_ITS  |       1 |
>>         | MIG_PRI_GICV3      |       1 |
>>         |--------------------+---------|
> 
> iommu is probably ok. I think virtio mem is ok too,
> in that it is normally created by virtio-mem-pci ...

IIRC:

intel-iommu has to be created on the QEMU cmdline before creating 
virtio-mem-pci.

-device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \
...
-device 
virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,...

Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem 
device state has to be migrated before migrating intel-iommu state.

I do wonder if migration priorities are really what we want to reuse 
here. I guess it works right, but just by pure luck (because we ignore 
the implicit dependency regarding priorities)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-24 16:24         ` David Hildenbrand
@ 2021-08-24 19:52           ` Peter Xu
  2021-08-25  8:08             ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-08-24 19:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Tue, Aug 24, 2021 at 06:24:27PM +0200, David Hildenbrand wrote:
> > > Not so much; here's the list of priorities and the devices using it:
> > > 
> > >         |--------------------+---------|
> > >         | priority           | devices |
> > >         |--------------------+---------|
> > >         | MIG_PRI_IOMMU      |       3 |
> > >         | MIG_PRI_PCI_BUS    |       7 |
> > >         | MIG_PRI_VIRTIO_MEM |       1 |
> > >         | MIG_PRI_GICV3_ITS  |       1 |
> > >         | MIG_PRI_GICV3      |       1 |
> > >         |--------------------+---------|
> > 
> > iommu is probably ok. I think virtio mem is ok too,
> > in that it is normally created by virtio-mem-pci ...
> 
> IIRC:
> 
> intel-iommu has to be created on the QEMU cmdline before creating
> virtio-mem-pci.
> 
> -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \
> ...
> -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,...
> 
> Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem device
> state has to be migrated before migrating intel-iommu state.

Since we're at it.. frankly I didn't fully digest why virtio-mem needs to be
migrated before when reading commit 0fd7616e0f1171b.  It gives me the feeling
more like that virtio-mem has a ordering dependency with vfio-pci not
virtio-mem, but I could be wrong.

E.g., the IOMMU unit shouldn't be walking page table if without a device using
it, then it won't fail until the device walks it in region_add() hooks when
memory replay happens.

> 
> I do wonder if migration priorities are really what we want to reuse here. I
> guess it works right, but just by pure luck (because we ignore the implicit
> dependency regarding priorities)

Yep, that's probably not what we want.  I'll make it a new list of priorities
as I've stated in the other thread.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-24 15:50             ` Peter Xu
@ 2021-08-25  4:23               ` Jason Wang
  2021-09-06  9:22                 ` Eric Auger
  0 siblings, 1 reply; 52+ messages in thread
From: Jason Wang @ 2021-08-25  4:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	qemu-devel, Markus Armbruster, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

On Tue, Aug 24, 2021 at 11:50 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote:
> > It looks to me this doesn't solve the issue of using virtio-mmio with vhost?
>
> No IOMMU supported for any of the MMIO devices, right?  Or am I wrong?  Thanks,

I guess virtio IOMMU has that support, but I might be wrong.

Thanks

>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-24 19:52           ` Peter Xu
@ 2021-08-25  8:08             ` David Hildenbrand
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2021-08-25  8:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On 24.08.21 21:52, Peter Xu wrote:
> On Tue, Aug 24, 2021 at 06:24:27PM +0200, David Hildenbrand wrote:
>>>> Not so much; here's the list of priorities and the devices using it:
>>>>
>>>>          |--------------------+---------|
>>>>          | priority           | devices |
>>>>          |--------------------+---------|
>>>>          | MIG_PRI_IOMMU      |       3 |
>>>>          | MIG_PRI_PCI_BUS    |       7 |
>>>>          | MIG_PRI_VIRTIO_MEM |       1 |
>>>>          | MIG_PRI_GICV3_ITS  |       1 |
>>>>          | MIG_PRI_GICV3      |       1 |
>>>>          |--------------------+---------|
>>>
>>> iommu is probably ok. I think virtio mem is ok too,
>>> in that it is normally created by virtio-mem-pci ...
>>
>> IIRC:
>>
>> intel-iommu has to be created on the QEMU cmdline before creating
>> virtio-mem-pci.
>>
>> -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \
>> ...
>> -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,...
>>
>> Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem device
>> state has to be migrated before migrating intel-iommu state.
> 
> Since we're at it.. frankly I didn't fully digest why virtio-mem needs to be
> migrated before when reading commit 0fd7616e0f1171b.  It gives me the feeling
> more like that virtio-mem has a ordering dependency with vfio-pci not
> virtio-mem, but I could be wrong.

"We have to take care of incoming migration: at the point the
  IOMMUs get restored and start creating mappings in vfio,
  RamDiscardManager implementations might not be back up and running yet:
  let's add runstate priorities to enforce the order when restoring.

s/runstate/vmstate/ :|

I recall that we can see IOMMU_NOTIFIER_MAP events when restoring an 
IOMMU device. vfio_get_xlat_addr() would be called and could reject 
mappings targeting virtio-mem memory in case the virtio-mem device has 
not restored its bitmap from the migration stream such that 
ram_discard_manager_is_populated() would be reliable. Consequently, we 
have to restore the virtio-mem device state (not the virtio-mem-pci 
device state!) before restoring an IOMMU.



> 
> E.g., the IOMMU unit shouldn't be walking page table if without a device using
> it, then it won't fail until the device walks it in region_add() hooks when
> memory replay happens.

I recall it happened when switching to the iommu region e.g., in 
vtd_post_load()->vtd_switch_address_space(). But I forgot the exact call 
path.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-23 23:05             ` Peter Xu
@ 2021-08-25  9:39               ` Markus Armbruster
  2021-08-25 12:28                 ` Markus Armbruster
  0 siblings, 1 reply; 52+ messages in thread
From: Markus Armbruster @ 2021-08-25  9:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini

Peter Xu <peterx@redhat.com> writes:

> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
>> I don't have any other example, but I assume address assignment
>> based on ordering is a common pattern in device code.
>> 
>> I would take a very close and careful look at the devices with
>> non-default vmsd priority.  If you can prove that the 13 device
>> types with non-default priority are all order-insensitive, a
>> custom sort function as you describe might be safe.
>
> Besides virtio-mem-pci, there'll also similar devfn issue with all
> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
> below two cmdlines will generate different pci topology too:
>
>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>                        -device pcie-root-port,chassis=1 \
>                        -device virtio-net-pci
>
> And:
>
>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>                        -device virtio-net-pci
>                        -device pcie-root-port,chassis=1 \
>
> This cannot be solved by keeping priority==0 ordering.
>
> After a second thought, I think I was initially wrong on seeing migration
> priority and device realization the same problem.
>
> For example, for live migration we have a requirement on PCI_BUS being migrated
> earlier than MIG_PRI_IOMMU because there's bus number information required
> because IOMMU relies on the bus number to find address spaces.  However that's
> definitely not a requirement for device realizations, say, realizing vIOMMU
> after pci buses are fine (bus assigned during bios).
>
> I've probably messed up with the ideas (though they really look alike!).  Sorry
> about that.
>
> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
> earlier.  That'll also avoid other implications on pci devfn allocations.
>
> Will rework a new version tomorrow.  Thanks a lot for all the comments,

Is it really a good idea to magically reorder device realization just to
make a non-working command line work?  Why can't we just fail the
non-working command line in a way that tells users how to get a working
one?  We have way too much ordering magic already...

If we decide we want more magic, then I'd argue for *dependencies*
instead of priorities.  Dependencies are specific and local: $this needs
to go after $that because $reasons.  Priorities are unspecific and
global.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-25  9:39               ` Markus Armbruster
@ 2021-08-25 12:28                 ` Markus Armbruster
  2021-08-25 21:50                   ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Markus Armbruster @ 2021-08-25 12:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini

Markus Armbruster <armbru@redhat.com> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
>>> I don't have any other example, but I assume address assignment
>>> based on ordering is a common pattern in device code.
>>> 
>>> I would take a very close and careful look at the devices with
>>> non-default vmsd priority.  If you can prove that the 13 device
>>> types with non-default priority are all order-insensitive, a
>>> custom sort function as you describe might be safe.
>>
>> Besides virtio-mem-pci, there'll also similar devfn issue with all
>> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
>> below two cmdlines will generate different pci topology too:
>>
>>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>>                        -device pcie-root-port,chassis=1 \
>>                        -device virtio-net-pci
>>
>> And:
>>
>>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>>                        -device virtio-net-pci
>>                        -device pcie-root-port,chassis=1 \
>>
>> This cannot be solved by keeping priority==0 ordering.
>>
>> After a second thought, I think I was initially wrong on seeing migration
>> priority and device realization the same problem.
>>
>> For example, for live migration we have a requirement on PCI_BUS being migrated
>> earlier than MIG_PRI_IOMMU because there's bus number information required
>> because IOMMU relies on the bus number to find address spaces.  However that's
>> definitely not a requirement for device realizations, say, realizing vIOMMU
>> after pci buses are fine (bus assigned during bios).
>>
>> I've probably messed up with the ideas (though they really look alike!).  Sorry
>> about that.
>>
>> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
>> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
>> earlier.  That'll also avoid other implications on pci devfn allocations.
>>
>> Will rework a new version tomorrow.  Thanks a lot for all the comments,
>
> Is it really a good idea to magically reorder device realization just to
> make a non-working command line work?  Why can't we just fail the
> non-working command line in a way that tells users how to get a working
> one?  We have way too much ordering magic already...
>
> If we decide we want more magic, then I'd argue for *dependencies*
> instead of priorities.  Dependencies are specific and local: $this needs
> to go after $that because $reasons.  Priorities are unspecific and
> global.

Having thought about this a bit more...

Constraints on realize order are nothing new.  For instance, when a
device plugs into a bus, it needs to be realized after the device
providing the bus.

We ensure this by having the device refer to the bus, e.g. bus=pci.0.
The reference may be implicit, but it's there.  It must resolve for
device creation to succeed, and if it resolves, the device providing the
bus will be realized in time.

I believe what's getting us into trouble with IOMMU is not having such a
reference.  Or in other words, keeping the dependence between the IOMMU
and the devices relying on it *implicit*, and thus hidden from the
existing realize-ordering machinery.

Instead of inventing another such machinery, let's try to use the one we
already have.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-25 12:28                 ` Markus Armbruster
@ 2021-08-25 21:50                   ` Peter Xu
  2021-08-26  3:50                     ` Peter Xu
  2021-08-26  4:57                     ` Markus Armbruster
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-25 21:50 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini

On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
> Markus Armbruster <armbru@redhat.com> writes:
> 
> > Peter Xu <peterx@redhat.com> writes:
> >
> >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
> >>> I don't have any other example, but I assume address assignment
> >>> based on ordering is a common pattern in device code.
> >>> 
> >>> I would take a very close and careful look at the devices with
> >>> non-default vmsd priority.  If you can prove that the 13 device
> >>> types with non-default priority are all order-insensitive, a
> >>> custom sort function as you describe might be safe.
> >>
> >> Besides virtio-mem-pci, there'll also similar devfn issue with all
> >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
> >> below two cmdlines will generate different pci topology too:
> >>
> >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> >>                        -device pcie-root-port,chassis=1 \
> >>                        -device virtio-net-pci
> >>
> >> And:
> >>
> >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> >>                        -device virtio-net-pci
> >>                        -device pcie-root-port,chassis=1 \
> >>
> >> This cannot be solved by keeping priority==0 ordering.
> >>
> >> After a second thought, I think I was initially wrong on seeing migration
> >> priority and device realization the same problem.
> >>
> >> For example, for live migration we have a requirement on PCI_BUS being migrated
> >> earlier than MIG_PRI_IOMMU because there's bus number information required
> >> because IOMMU relies on the bus number to find address spaces.  However that's
> >> definitely not a requirement for device realizations, say, realizing vIOMMU
> >> after pci buses are fine (bus assigned during bios).
> >>
> >> I've probably messed up with the ideas (though they really look alike!).  Sorry
> >> about that.
> >>
> >> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
> >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
> >> earlier.  That'll also avoid other implications on pci devfn allocations.
> >>
> >> Will rework a new version tomorrow.  Thanks a lot for all the comments,
> >
> > Is it really a good idea to magically reorder device realization just to
> > make a non-working command line work?  Why can't we just fail the
> > non-working command line in a way that tells users how to get a working
> > one?  We have way too much ordering magic already...
> >
> > If we decide we want more magic, then I'd argue for *dependencies*
> > instead of priorities.  Dependencies are specific and local: $this needs
> > to go after $that because $reasons.  Priorities are unspecific and
> > global.
> 
> Having thought about this a bit more...
> 
> Constraints on realize order are nothing new.  For instance, when a
> device plugs into a bus, it needs to be realized after the device
> providing the bus.
> 
> We ensure this by having the device refer to the bus, e.g. bus=pci.0.
> The reference may be implicit, but it's there.  It must resolve for
> device creation to succeed, and if it resolves, the device providing the
> bus will be realized in time.
> 
> I believe what's getting us into trouble with IOMMU is not having such a
> reference.  Or in other words, keeping the dependence between the IOMMU
> and the devices relying on it *implicit*, and thus hidden from the
> existing realize-ordering machinery.
> 
> Instead of inventing another such machinery, let's try to use the one we
> already have.

Hmm... I just found that we don't have such machinery, do we?

This does not really work:

$ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
                       -device pcie-root-port,id=pcie.1,bus=pcie.0
qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found

While this will:

$ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
                       -device virtio-net-pci,bus=pcie.1

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-25 21:50                   ` Peter Xu
@ 2021-08-26  3:50                     ` Peter Xu
  2021-08-26  8:01                       ` Markus Armbruster
  2021-08-26  4:57                     ` Markus Armbruster
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-08-26  3:50 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini

On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:
> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
> > Markus Armbruster <armbru@redhat.com> writes:
> > 
> > > Peter Xu <peterx@redhat.com> writes:
> > >
> > >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
> > >>> I don't have any other example, but I assume address assignment
> > >>> based on ordering is a common pattern in device code.
> > >>> 
> > >>> I would take a very close and careful look at the devices with
> > >>> non-default vmsd priority.  If you can prove that the 13 device
> > >>> types with non-default priority are all order-insensitive, a
> > >>> custom sort function as you describe might be safe.
> > >>
> > >> Besides virtio-mem-pci, there'll also similar devfn issue with all
> > >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
> > >> below two cmdlines will generate different pci topology too:
> > >>
> > >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> > >>                        -device pcie-root-port,chassis=1 \
> > >>                        -device virtio-net-pci
> > >>
> > >> And:
> > >>
> > >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> > >>                        -device virtio-net-pci
> > >>                        -device pcie-root-port,chassis=1 \
> > >>
> > >> This cannot be solved by keeping priority==0 ordering.
> > >>
> > >> After a second thought, I think I was initially wrong on seeing migration
> > >> priority and device realization the same problem.
> > >>
> > >> For example, for live migration we have a requirement on PCI_BUS being migrated
> > >> earlier than MIG_PRI_IOMMU because there's bus number information required
> > >> because IOMMU relies on the bus number to find address spaces.  However that's
> > >> definitely not a requirement for device realizations, say, realizing vIOMMU
> > >> after pci buses are fine (bus assigned during bios).
> > >>
> > >> I've probably messed up with the ideas (though they really look alike!).  Sorry
> > >> about that.
> > >>
> > >> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
> > >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
> > >> earlier.  That'll also avoid other implications on pci devfn allocations.
> > >>
> > >> Will rework a new version tomorrow.  Thanks a lot for all the comments,
> > >
> > > Is it really a good idea to magically reorder device realization just to
> > > make a non-working command line work?  Why can't we just fail the
> > > non-working command line in a way that tells users how to get a working
> > > one?  We have way too much ordering magic already...
> > >
> > > If we decide we want more magic, then I'd argue for *dependencies*
> > > instead of priorities.  Dependencies are specific and local: $this needs
> > > to go after $that because $reasons.  Priorities are unspecific and
> > > global.
> > 
> > Having thought about this a bit more...
> > 
> > Constraints on realize order are nothing new.  For instance, when a
> > device plugs into a bus, it needs to be realized after the device
> > providing the bus.
> > 
> > We ensure this by having the device refer to the bus, e.g. bus=pci.0.
> > The reference may be implicit, but it's there.  It must resolve for
> > device creation to succeed, and if it resolves, the device providing the
> > bus will be realized in time.
> > 
> > I believe what's getting us into trouble with IOMMU is not having such a
> > reference.  Or in other words, keeping the dependence between the IOMMU
> > and the devices relying on it *implicit*, and thus hidden from the
> > existing realize-ordering machinery.

[1]

> > 
> > Instead of inventing another such machinery, let's try to use the one we
> > already have.
> 
> Hmm... I just found that we don't have such machinery, do we?
> 
> This does not really work:
> 
> $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
>                        -device pcie-root-port,id=pcie.1,bus=pcie.0
> qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found
> 
> While this will:
> 
> $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
>                        -device virtio-net-pci,bus=pcie.1

I think I fully agree at [1], the iommu is special in that it's not only
implicit, but also does not have a default value.  Pci buses have default
values (the root pci bus; e.g. pcie.0 on q35), so it's actually easier.

When parsing the "-device" entry for a pci device, we'll 100% sure what's the
pci bus for the device, either specified or it's just the default.  We don't
look at the rest of "-device"s to be sure of it.  We just try to look up the
pci bus, if it's there we continue, otherwise we abort.

But IOMMU is different, the device can run without a vIOMMU (in which case
there's no dependency), but when vIOMMU is specified there's the dependency.

That's why I am not sure whether the old machinery and "early failure" solution
would work trivially for IOMMUs.

We can add an "-iommu" parameter, then we simply parse it before "-device".
However I forgot why Marcel (IIRC) made it a "-device" parameter, also "-iommu"
is not ideal in that the IOMMU unit is indeed implemented as a device for the
archs I'm aware of, so it's kind of some extra glue that seems to just work
around the ordering problem we have right now.  Meanwhile that solution won't
help the ordering issue of pci bus/device.

That's why I still think the idea of having a global priority for device
realization (or describe the dependency of devices) makes sense.  We can
actually fix both IOMMU and pci bus so we can allow pci bus to be put latter
than the pci device that belongs to the bus alongside of fixing the IOMMU.

IMHO a list of global priority (maybe a realize_priority field in the class of
devices) is just a simpler version of device dependencies.  Say, at least we
don't need to worry about cycle-dependency issues.  So far the ordering
requirement is still simple, so globally define them should fix most of the
issues already and in a straightforward way, also less LOCs.  If it goes
complicated one day, we can always switch the global priority list into device
dependency easily, because that's not guest ABI.

Any further thoughts will be greatly welcomed.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-25 21:50                   ` Peter Xu
  2021-08-26  3:50                     ` Peter Xu
@ 2021-08-26  4:57                     ` Markus Armbruster
  1 sibling, 0 replies; 52+ messages in thread
From: Markus Armbruster @ 2021-08-26  4:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
>> Markus Armbruster <armbru@redhat.com> writes:
>> 
>> > Peter Xu <peterx@redhat.com> writes:
>> >
>> >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
>> >>> I don't have any other example, but I assume address assignment
>> >>> based on ordering is a common pattern in device code.
>> >>> 
>> >>> I would take a very close and careful look at the devices with
>> >>> non-default vmsd priority.  If you can prove that the 13 device
>> >>> types with non-default priority are all order-insensitive, a
>> >>> custom sort function as you describe might be safe.
>> >>
>> >> Besides virtio-mem-pci, there'll also similar devfn issue with all
>> >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
>> >> below two cmdlines will generate different pci topology too:
>> >>
>> >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>> >>                        -device pcie-root-port,chassis=1 \
>> >>                        -device virtio-net-pci
>> >>
>> >> And:
>> >>
>> >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>> >>                        -device virtio-net-pci
>> >>                        -device pcie-root-port,chassis=1 \
>> >>
>> >> This cannot be solved by keeping priority==0 ordering.
>> >>
>> >> After a second thought, I think I was initially wrong on seeing migration
>> >> priority and device realization the same problem.
>> >>
>> >> For example, for live migration we have a requirement on PCI_BUS being migrated
>> >> earlier than MIG_PRI_IOMMU because there's bus number information required
>> >> because IOMMU relies on the bus number to find address spaces.  However that's
>> >> definitely not a requirement for device realizations, say, realizing vIOMMU
>> >> after pci buses are fine (bus assigned during bios).
>> >>
>> >> I've probably messed up with the ideas (though they really look alike!).  Sorry
>> >> about that.
>> >>
>> >> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
>> >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
>> >> earlier.  That'll also avoid other implications on pci devfn allocations.
>> >>
>> >> Will rework a new version tomorrow.  Thanks a lot for all the comments,
>> >
>> > Is it really a good idea to magically reorder device realization just to
>> > make a non-working command line work?  Why can't we just fail the
>> > non-working command line in a way that tells users how to get a working
>> > one?  We have way too much ordering magic already...
>> >
>> > If we decide we want more magic, then I'd argue for *dependencies*
>> > instead of priorities.  Dependencies are specific and local: $this needs
>> > to go after $that because $reasons.  Priorities are unspecific and
>> > global.
>> 
>> Having thought about this a bit more...
>> 
>> Constraints on realize order are nothing new.  For instance, when a
>> device plugs into a bus, it needs to be realized after the device
>> providing the bus.
>> 
>> We ensure this by having the device refer to the bus, e.g. bus=pci.0.
>> The reference may be implicit, but it's there.  It must resolve for
>> device creation to succeed, and if it resolves, the device providing the
>> bus will be realized in time.
>> 
>> I believe what's getting us into trouble with IOMMU is not having such a
>> reference.  Or in other words, keeping the dependence between the IOMMU
>> and the devices relying on it *implicit*, and thus hidden from the
>> existing realize-ordering machinery.
>> 
>> Instead of inventing another such machinery, let's try to use the one we
>> already have.
>
> Hmm... I just found that we don't have such machinery, do we?
>
> This does not really work:
>
> $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
>                        -device pcie-root-port,id=pcie.1,bus=pcie.0
> qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found
>
> While this will:
>
> $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
>                        -device virtio-net-pci,bus=pcie.1

This is exactly what I described.  bus=pcie.0 is the explicit reference.
It must resolve for device creation to succeed, and if it resolves, the
device providing the bus will be realized in time.  It resolves in the
second example, but not the first.

Look ma, no magic!  Instead, stupid & predictable.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-26  3:50                     ` Peter Xu
@ 2021-08-26  8:01                       ` Markus Armbruster
  2021-08-26 11:36                         ` Igor Mammedov
  0 siblings, 1 reply; 52+ messages in thread
From: Markus Armbruster @ 2021-08-26  8:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini

Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:
>> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
>> > Having thought about this a bit more...
>> > 
>> > Constraints on realize order are nothing new.  For instance, when a
>> > device plugs into a bus, it needs to be realized after the device
>> > providing the bus.
>> > 
>> > We ensure this by having the device refer to the bus, e.g. bus=pci.0.
>> > The reference may be implicit, but it's there.  It must resolve for
>> > device creation to succeed, and if it resolves, the device providing the
>> > bus will be realized in time.
>> > 
>> > I believe what's getting us into trouble with IOMMU is not having such a
>> > reference.  Or in other words, keeping the dependence between the IOMMU
>> > and the devices relying on it *implicit*, and thus hidden from the
>> > existing realize-ordering machinery.
>
> [1]
>
>> > 
>> > Instead of inventing another such machinery, let's try to use the one we
>> > already have.
>> 
>> Hmm... I just found that we don't have such machinery, do we?
>> 
>> This does not really work:
>> 
>> $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
>>                        -device pcie-root-port,id=pcie.1,bus=pcie.0
>> qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found
>> 
>> While this will:
>> 
>> $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
>>                        -device virtio-net-pci,bus=pcie.1
>
> I think I fully agree at [1], the iommu is special in that it's not only
> implicit, but also does not have a default value.  Pci buses have default
> values (the root pci bus; e.g. pcie.0 on q35), so it's actually easier.

To be precise: when device property "bus" is absent, and the device
plugs into a bus, QEMU picks a suitable bus.  If it can't, device
creation fails.

It can't when no bus of the required type exists, or the existing buses
are all full.

Sometimes, the board provides a bus of the required type.  If not, you
have to plug one, and you have to do it before you refer to it, whether
the reference is explicit (bus=...) or implicit (bus absent).

Example 1: no bus of the required type

    $ qemu-system-x86_64 -S -display none -device usb-mouse
    qemu-system-x86_64: -device usb-mouse: No 'usb-bus' bus found for device 'usb-mouse'

Example 2: no bus of the required type, yet

    $ qemu-system-x86_64 -S -display none -device usb-mouse -device qemu-xhci
    qemu-system-x86_64: -device usb-mouse: No 'usb-bus' bus found for device 'usb-mouse'

Example 3: reordered to the bus is there in time

    $ qemu-system-x86_64 -S -display none -device qemu-xhci -device usb-mouse

Example 4: got a bus, but it's full

    $ qemu-system-x86_64 -S -display none -monitor stdio -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000,id=noslot
    QEMU 6.1.0 monitor - type 'help' for more information
    (qemu) qemu-system-x86_64: -device e1000,id=noslot: PCI: no slot/function available for e1000, all in use or reserved

> When parsing the "-device" entry for a pci device, we'll 100% sure what's the
> pci bus for the device, either specified or it's just the default.  We don't
> look at the rest of "-device"s to be sure of it.  We just try to look up the
> pci bus, if it's there we continue, otherwise we abort.

Yes.  It's a mandatory dependency, optionally explicit.

> But IOMMU is different, the device can run without a vIOMMU (in which case
> there's no dependency), but when vIOMMU is specified there's the dependency.
>
> That's why I am not sure whether the old machinery and "early failure" solution
> would work trivially for IOMMUs.

Yes.  It's an optional, implicit dependency.

> We can add an "-iommu" parameter, then we simply parse it before "-device".
> However I forgot why Marcel (IIRC) made it a "-device" parameter, also "-iommu"
> is not ideal in that the IOMMU unit is indeed implemented as a device for the
> archs I'm aware of, so it's kind of some extra glue that seems to just work
> around the ordering problem we have right now.  Meanwhile that solution won't
> help the ordering issue of pci bus/device.

The "device providing bus before device plugging into bus" dependency is
not actually a problem: -device and device_add ensure by design it's
satisfied.

As discussed before, there are two workable ways to process the command
line: strictly left to right (leave ordering to the user), and "do the
right thing" (order of options doesn't matter).

Of course, we do neither.  We kind of try to do the right thing, by
adding special cases whenever we get bitten.  Order doesn't matter,
except when it does, and things work, except when they don't.
Reordering your command line may or may not get it to work.

Fails the basic interface taste test: would explaining it in plain
English be impractical and/or embarrassing?

How to best get out of this self-dug hole isn't obvious.  Switching to
strictly left to right will break some command lines.  Making order
truly not matter looks hard, because the dependencies are complex and
not well understood.  But this isn't the problem at hand here.  It's
whether to add more magic, or do without.

I'm wary about piling more magic onto a heap of magic that's already
wobbling under its weight.

So, can we do without?

The issue, as I understand it, is that certain devices need to know at
realize time whether the machine has an IOMMU.  When you add an IOMMU
with -device, you morph the machine from one that doesn't have one into
one that does.  Therefore, it needs to be added before any of the
devices that need to know whether the machine has one.

> That's why I still think the idea of having a global priority for device
> realization (or describe the dependency of devices) makes sense.

The proposed solution is to magically reorder -device so that IOMMU get
created before devices that need to know.

This solution falls apart when we use QMP device_add instead of -device,
because we can't reorder QMP commands.

Right now we can't use device_add for everything, but that's fixable.
Maybe it's been fixed already.

To keep the proposed solution working, we'd need even more magic.

>                                                                   We can
> actually fix both IOMMU and pci bus so we can allow pci bus to be put latter
> than the pci device that belongs to the bus alongside of fixing the IOMMU.

To "fix" the bus issue, we'd need to delay resolution of property bus
until realize time.  This might break .instance_init() methods.  Of
which we have several hundred.

> IMHO a list of global priority (maybe a realize_priority field in the class of
> devices) is just a simpler version of device dependencies.  Say, at least we
> don't need to worry about cycle-dependency issues.  So far the ordering
> requirement is still simple, so globally define them should fix most of the
> issues already and in a straightforward way, also less LOCs.  If it goes
> complicated one day, we can always switch the global priority list into device
> dependency easily, because that's not guest ABI.
>
> Any further thoughts will be greatly welcomed.

I'd like to propose a more modest solution: detect the problem and fail.

A simple state machine can track "has IOMMU" state.  It has three states
"no so far", "yes", and "no", and two events "add IOMMU" and "add device
that needs to know".  State diagram:

                          no so far
                   +--- (start state) ---+
                   |                     |
         add IOMMU |                     | add device that
                   |                     |  needs to know
                   v                     v
             +--> yes                    no <--+
             |     |   add device that   |     |
             +-----+    needs to know    +-----+

"Add IOMMU" in state "no" is an error.

"Add IOMMU" in state "yes" stays in state "yes" if multiple IOMMU make
sense.  Else it's an error.

The state machine could be owned by the machine type.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-26  8:01                       ` Markus Armbruster
@ 2021-08-26 11:36                         ` Igor Mammedov
  2021-08-26 13:43                           ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Igor Mammedov @ 2021-08-26 11:36 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Peter Xu, Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini

On Thu, 26 Aug 2021 10:01:01 +0200
Markus Armbruster <armbru@redhat.com> wrote:

> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:  
> >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:  
> >> > Having thought about this a bit more...
...
> > Any further thoughts will be greatly welcomed.  
> 
> I'd like to propose a more modest solution: detect the problem and fail.
That's or proper dependency tracking (which stands chance to work with QMP,
but like it was said it's complex)

> A simple state machine can track "has IOMMU" state.  It has three states
> "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> that needs to know".  State diagram:
> 
>                           no so far
>                    +--- (start state) ---+
>                    |                     |
>          add IOMMU |                     | add device that
>                    |                     |  needs to know
>                    v                     v
>              +--> yes                    no <--+
>              |     |   add device that   |     |
>              +-----+    needs to know    +-----+
> 
> "Add IOMMU" in state "no" is an error.

question is how we distinguish "device that needs to know"
from device that doesn't need to know, and then recently
added feature 'bypass IOMMU' adds more fun to this.
 
> "Add IOMMU" in state "yes" stays in state "yes" if multiple IOMMU make
> sense.  Else it's an error.
> 
> The state machine could be owned by the machine type.
> 
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-26 11:36                         ` Igor Mammedov
@ 2021-08-26 13:43                           ` Peter Xu
  2021-08-30 19:02                             ` Peter Xu
  2021-09-02  7:46                             ` Igor Mammedov
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-26 13:43 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Thu, Aug 26, 2021 at 01:36:29PM +0200, Igor Mammedov wrote:
> On Thu, 26 Aug 2021 10:01:01 +0200
> Markus Armbruster <armbru@redhat.com> wrote:
> 
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:  
> > >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:  
> > >> > Having thought about this a bit more...
> ...
> > > Any further thoughts will be greatly welcomed.  
> > 
> > I'd like to propose a more modest solution: detect the problem and fail.
> That's or proper dependency tracking (which stands chance to work with QMP,
> but like it was said it's complex)
> 
> > A simple state machine can track "has IOMMU" state.  It has three states
> > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > that needs to know".  State diagram:
> > 
> >                           no so far
> >                    +--- (start state) ---+
> >                    |                     |
> >          add IOMMU |                     | add device that
> >                    |                     |  needs to know
> >                    v                     v
> >              +--> yes                    no <--+
> >              |     |   add device that   |     |
> >              +-----+    needs to know    +-----+
> > 
> > "Add IOMMU" in state "no" is an error.
> 
> question is how we distinguish "device that needs to know"
> from device that doesn't need to know, and then recently
> added feature 'bypass IOMMU' adds more fun to this.

Maybe we can start from "no device needs to know"? Then add more into it when
the list expands.

vfio-pci should be a natural fit because we're sure it won't break any valid
old configurations.  However we may need to be careful on adding more devices,
e.g. when some old configuration might work on old binaries, but I'm not sure.

Off-topic: I'm wondering whether bypass_iommu is just a work-around of not
using iommu=pt in the guest cmdlines?  Is there any performance comparison of
using bypass_iommu against having iommu but with iommu=pt?  At least intel
iommu has pt enabled explicitly, i.e. it shouldn't even need a shadow iommu
pgtable in the guest but only a single bit in device context entry showing that
"this device wants to pass-through iommu", so I would expect the perf can be
similar to explicitly bypass iommu in the qemu cmdline.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-26 13:43                           ` Peter Xu
@ 2021-08-30 19:02                             ` Peter Xu
  2021-08-31 11:35                               ` Markus Armbruster
  2021-09-02  8:26                               ` Igor Mammedov
  2021-09-02  7:46                             ` Igor Mammedov
  1 sibling, 2 replies; 52+ messages in thread
From: Peter Xu @ 2021-08-30 19:02 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
> > > A simple state machine can track "has IOMMU" state.  It has three states
> > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > that needs to know".  State diagram:
> > > 
> > >                           no so far
> > >                    +--- (start state) ---+
> > >                    |                     |
> > >          add IOMMU |                     | add device that
> > >                    |                     |  needs to know
> > >                    v                     v
> > >              +--> yes                    no <--+
> > >              |     |   add device that   |     |
> > >              +-----+    needs to know    +-----+
> > > 
> > > "Add IOMMU" in state "no" is an error.
> > 
> > question is how we distinguish "device that needs to know"
> > from device that doesn't need to know, and then recently
> > added feature 'bypass IOMMU' adds more fun to this.
> 
> Maybe we can start from "no device needs to know"? Then add more into it when
> the list expands.
> 
> vfio-pci should be a natural fit because we're sure it won't break any valid
> old configurations.  However we may need to be careful on adding more devices,
> e.g. when some old configuration might work on old binaries, but I'm not sure.

Btw, I think this state machine is indeed bringing some complexity on even
understanding it - it is definitely working but it's not obvious to anyone at
the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
yet another state machine for some other ordering constraints?

From that POV, I don't like this solution more than the simple "assign priority
for device realization" idea..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-30 19:02                             ` Peter Xu
@ 2021-08-31 11:35                               ` Markus Armbruster
  2021-09-02  8:26                               ` Igor Mammedov
  1 sibling, 0 replies; 52+ messages in thread
From: Markus Armbruster @ 2021-08-31 11:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini, Igor Mammedov

Peter Xu <peterx@redhat.com> writes:

> On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
>> > > A simple state machine can track "has IOMMU" state.  It has three states
>> > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
>> > > that needs to know".  State diagram:
>> > > 
>> > >                           no so far
>> > >                    +--- (start state) ---+
>> > >                    |                     |
>> > >          add IOMMU |                     | add device that
>> > >                    |                     |  needs to know
>> > >                    v                     v
>> > >              +--> yes                    no <--+
>> > >              |     |   add device that   |     |
>> > >              +-----+    needs to know    +-----+
>> > > 
>> > > "Add IOMMU" in state "no" is an error.
>> > 
>> > question is how we distinguish "device that needs to know"
>> > from device that doesn't need to know, and then recently
>> > added feature 'bypass IOMMU' adds more fun to this.
>> 
>> Maybe we can start from "no device needs to know"? Then add more into it when
>> the list expands.
>> 
>> vfio-pci should be a natural fit because we're sure it won't break any valid
>> old configurations.  However we may need to be careful on adding more devices,
>> e.g. when some old configuration might work on old binaries, but I'm not sure.
>
> Btw, I think this state machine is indeed bringing some complexity on even
> understanding it - it is definitely working but it's not obvious to anyone at
> the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> yet another state machine for some other ordering constraints?
>
> From that POV, I don't like this solution more than the simple "assign priority
> for device realization" idea..

I wouldn't worry about other ordering constraints until we have them.
If you do, please tell!

I'm hoping you can't, because such implicit constraints are commonly
signs of oversimplified / screwed up machine modeling.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-26 13:43                           ` Peter Xu
  2021-08-30 19:02                             ` Peter Xu
@ 2021-09-02  7:46                             ` Igor Mammedov
  1 sibling, 0 replies; 52+ messages in thread
From: Igor Mammedov @ 2021-09-02  7:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert, Wang Xingang

On Thu, 26 Aug 2021 09:43:59 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Aug 26, 2021 at 01:36:29PM +0200, Igor Mammedov wrote:
> > On Thu, 26 Aug 2021 10:01:01 +0200
> > Markus Armbruster <armbru@redhat.com> wrote:
> >   
> > > Peter Xu <peterx@redhat.com> writes:
> > >   
> > > > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:    
> > > >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:    
> > > >> > Having thought about this a bit more...  
> > ...  
> > > > Any further thoughts will be greatly welcomed.    
> > > 
> > > I'd like to propose a more modest solution: detect the problem and fail.  
> > That's or proper dependency tracking (which stands chance to work with QMP,
> > but like it was said it's complex)
> >   
> > > A simple state machine can track "has IOMMU" state.  It has three states
> > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > that needs to know".  State diagram:
> > > 
> > >                           no so far
> > >                    +--- (start state) ---+
> > >                    |                     |
> > >          add IOMMU |                     | add device that
> > >                    |                     |  needs to know
> > >                    v                     v
> > >              +--> yes                    no <--+
> > >              |     |   add device that   |     |
> > >              +-----+    needs to know    +-----+
> > > 
> > > "Add IOMMU" in state "no" is an error.  
> > 
> > question is how we distinguish "device that needs to know"
> > from device that doesn't need to know, and then recently
> > added feature 'bypass IOMMU' adds more fun to this.  
> 
> Maybe we can start from "no device needs to know"? Then add more into it when
> the list expands.
> 
> vfio-pci should be a natural fit because we're sure it won't break any valid
> old configurations.  However we may need to be careful on adding more devices,
> e.g. when some old configuration might work on old binaries, but I'm not sure.

 
> Off-topic: I'm wondering whether bypass_iommu is just a work-around of not
> using iommu=pt in the guest cmdlines?  Is there any performance comparison of
> using bypass_iommu against having iommu but with iommu=pt?  At least intel
> iommu has pt enabled explicitly, i.e. it shouldn't even need a shadow iommu
> pgtable in the guest but only a single bit in device context entry showing that
> "this device wants to pass-through iommu", so I would expect the perf can be
> similar to explicitly bypass iommu in the qemu cmdline.
They wanted to have a mix of iommu and non-iommu devices in VM
(last merged revision was: [PATCH v5 0/9] IOMMU: Add support for IOMMU Bypass Feature)
But 'why' reasoning was lost somewhere, CCing author.


> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-30 19:02                             ` Peter Xu
  2021-08-31 11:35                               ` Markus Armbruster
@ 2021-09-02  8:26                               ` Igor Mammedov
  2021-09-02 13:45                                 ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Igor Mammedov @ 2021-09-02  8:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Mon, 30 Aug 2021 15:02:53 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
> > > > A simple state machine can track "has IOMMU" state.  It has three states
> > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > > that needs to know".  State diagram:
> > > > 
> > > >                           no so far
> > > >                    +--- (start state) ---+
> > > >                    |                     |
> > > >          add IOMMU |                     | add device that
> > > >                    |                     |  needs to know
> > > >                    v                     v
> > > >              +--> yes                    no <--+
> > > >              |     |   add device that   |     |
> > > >              +-----+    needs to know    +-----+
> > > > 
> > > > "Add IOMMU" in state "no" is an error.  
> > > 
> > > question is how we distinguish "device that needs to know"
> > > from device that doesn't need to know, and then recently
> > > added feature 'bypass IOMMU' adds more fun to this.  
> > 
> > Maybe we can start from "no device needs to know"? Then add more into it when
> > the list expands.
> > 
> > vfio-pci should be a natural fit because we're sure it won't break any valid
> > old configurations.  However we may need to be careful on adding more devices,
> > e.g. when some old configuration might work on old binaries, but I'm not sure.  
> 
> Btw, I think this state machine is indeed bringing some complexity on even
> understanding it - it is definitely working but it's not obvious to anyone at
> the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> yet another state machine for some other ordering constraints?
>
> From that POV, I don't like this solution more than the simple "assign priority
> for device realization" idea..
It seems simple but implicit dependencies are fragile (good or
I'd rather say bad example implicit dependencies is vl.c magic code,
where changing order of initialization can easily break QEMU,
which happens almost every time it's refactored),
and as Markus already mentioned it won't work in QMP case.

What could work for both cases is explicit dependencies,
however it would be hard to describe such dependency in this case,
where device can work both with and without IOMMU depending
on the bus settings it's attached to.

Have you considered another approach, i.e. instead of reordering,
change vfio-pci device model to reconfigure DMA address space
after realize time (ex: at reset time)?




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-02  8:26                               ` Igor Mammedov
@ 2021-09-02 13:45                                 ` Peter Xu
  2021-09-02 13:53                                   ` Daniel P. Berrangé
                                                     ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Peter Xu @ 2021-09-02 13:45 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote:
> On Mon, 30 Aug 2021 15:02:53 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
> > > > > A simple state machine can track "has IOMMU" state.  It has three states
> > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > > > that needs to know".  State diagram:
> > > > > 
> > > > >                           no so far
> > > > >                    +--- (start state) ---+
> > > > >                    |                     |
> > > > >          add IOMMU |                     | add device that
> > > > >                    |                     |  needs to know
> > > > >                    v                     v
> > > > >              +--> yes                    no <--+
> > > > >              |     |   add device that   |     |
> > > > >              +-----+    needs to know    +-----+
> > > > > 
> > > > > "Add IOMMU" in state "no" is an error.  
> > > > 
> > > > question is how we distinguish "device that needs to know"
> > > > from device that doesn't need to know, and then recently
> > > > added feature 'bypass IOMMU' adds more fun to this.  
> > > 
> > > Maybe we can start from "no device needs to know"? Then add more into it when
> > > the list expands.
> > > 
> > > vfio-pci should be a natural fit because we're sure it won't break any valid
> > > old configurations.  However we may need to be careful on adding more devices,
> > > e.g. when some old configuration might work on old binaries, but I'm not sure.  
> > 
> > Btw, I think this state machine is indeed bringing some complexity on even
> > understanding it - it is definitely working but it's not obvious to anyone at
> > the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> > yet another state machine for some other ordering constraints?
> >
> > From that POV, I don't like this solution more than the simple "assign priority
> > for device realization" idea..
> It seems simple but implicit dependencies are fragile (good or
> I'd rather say bad example implicit dependencies is vl.c magic code,
> where changing order of initialization can easily break QEMU,
> which happens almost every time it's refactored),

True.  I was never able of telling whether that comes from natural complexity
of the piece of software or there's indeed some smarter moves.

> and as Markus already mentioned it won't work in QMP case.

I didn't ask before, but I do have a question on whether that's a real problem.

When QMP interface is ready, we should have a last phase of "preconfig done"
command, right?  I thought it was "x-exit-preconfig" now, but I'm not sure so
am guessing.  If so, can we do the reorder there and postpone all device
creations, maybe?  Then the ordering properties can be applied there too.

The other thing is I think only libvirt will use the QMP case even if it'll be
ready, and libvirt will need to know the ordering already like vIOMMU and vfio
and pci buses - even if a new qemu supported the ordering of all these, libvirt
still need to support old qemu binaries (and the code is already there anyway)
and it's fairly natural they initiate the QMP commands using the same ordering.

IMHO it means ordering for QMP is not as important as cmdline if that's always
initiated by another software.

> 
> What could work for both cases is explicit dependencies,
> however it would be hard to describe such dependency in this case,
> where device can work both with and without IOMMU depending
> on the bus settings it's attached to.
> 
> Have you considered another approach, i.e. instead of reordering,
> change vfio-pci device model to reconfigure DMA address space
> after realize time (ex: at reset time)?

Yes, I agree this seems to be the cleanest appproach.  It may just need more
changes and they'll be on vfio, and I'm not aware of the rest (Jason should
know better on virtio/vdpa).

The other thing is some dependency cannot be resolved by this, e.g., the pci
bus issue where we put the pci bus to be after the device that it owns.  But
indeed we're already early-fail that now so it seems ok.

Side note: from my gut feeling I still think at some point we shouldn't do the
early-failure for cases in the case of pci bus and device - the cmdline
obviously contains enough information on the dependency on "this pci device
needs that pci bus", and early-fail does seem to be an work-around to me just
because they specified in the wrong order.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-02 13:45                                 ` Peter Xu
@ 2021-09-02 13:53                                   ` Daniel P. Berrangé
  2021-09-02 14:21                                     ` Peter Xu
  2021-09-02 15:26                                   ` Markus Armbruster
  2021-09-03 13:00                                   ` Igor Mammedov
  2 siblings, 1 reply; 52+ messages in thread
From: Daniel P. Berrangé @ 2021-09-02 13:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Igor Mammedov, Dr . David Alan Gilbert

On Thu, Sep 02, 2021 at 09:45:43AM -0400, Peter Xu wrote:
> On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote:
> > On Mon, 30 Aug 2021 15:02:53 -0400
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> > > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
> > > > > > A simple state machine can track "has IOMMU" state.  It has three states
> > > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > > > > that needs to know".  State diagram:
> > > > > > 
> > > > > >                           no so far
> > > > > >                    +--- (start state) ---+
> > > > > >                    |                     |
> > > > > >          add IOMMU |                     | add device that
> > > > > >                    |                     |  needs to know
> > > > > >                    v                     v
> > > > > >              +--> yes                    no <--+
> > > > > >              |     |   add device that   |     |
> > > > > >              +-----+    needs to know    +-----+
> > > > > > 
> > > > > > "Add IOMMU" in state "no" is an error.  
> > > > > 
> > > > > question is how we distinguish "device that needs to know"
> > > > > from device that doesn't need to know, and then recently
> > > > > added feature 'bypass IOMMU' adds more fun to this.  
> > > > 
> > > > Maybe we can start from "no device needs to know"? Then add more into it when
> > > > the list expands.
> > > > 
> > > > vfio-pci should be a natural fit because we're sure it won't break any valid
> > > > old configurations.  However we may need to be careful on adding more devices,
> > > > e.g. when some old configuration might work on old binaries, but I'm not sure.  
> > > 
> > > Btw, I think this state machine is indeed bringing some complexity on even
> > > understanding it - it is definitely working but it's not obvious to anyone at
> > > the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> > > yet another state machine for some other ordering constraints?
> > >
> > > From that POV, I don't like this solution more than the simple "assign priority
> > > for device realization" idea..
> > It seems simple but implicit dependencies are fragile (good or
> > I'd rather say bad example implicit dependencies is vl.c magic code,
> > where changing order of initialization can easily break QEMU,
> > which happens almost every time it's refactored),
> 
> True.  I was never able of telling whether that comes from natural complexity
> of the piece of software or there's indeed some smarter moves.
> 
> > and as Markus already mentioned it won't work in QMP case.
> 
> I didn't ask before, but I do have a question on whether that's a real problem.
> 
> When QMP interface is ready, we should have a last phase of "preconfig done"
> command, right?  I thought it was "x-exit-preconfig" now, but I'm not sure so
> am guessing.  If so, can we do the reorder there and postpone all device
> creations, maybe?  Then the ordering properties can be applied there too.
> 
> The other thing is I think only libvirt will use the QMP case even if it'll be
> ready, and libvirt will need to know the ordering already like vIOMMU and vfio
> and pci buses - even if a new qemu supported the ordering of all these, libvirt
> still need to support old qemu binaries (and the code is already there anyway)
> and it's fairly natural they initiate the QMP commands using the same ordering.
> 
> IMHO it means ordering for QMP is not as important as cmdline if that's always
> initiated by another software.

The correct ordering of devices/backends is generally pretty obvious
for libvirt to determine. Most of the problems we've had related to
ordering are on the QEMU side, because the ARGV given to QEMU made
correct sense if parsed left-to-right, but QEMU didn't actually process
them in that order. We've patched QEMU to hack around its inability to
honour the CLI order repeatedly.

Being completely self-ordering on the QEMU side using a topological
sort would be neat from a conceptual purity POV, but that is quite a
challenge to implement and I'm not convinced it is worth it, compared
to other problems we want to spend time on.

If we don't want to keep hacking up the current QEMU system binaries
parsing, then we need to just have new binaries with sane ordering,
or we need to willingly break compat in some staged manner. 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-02 13:53                                   ` Daniel P. Berrangé
@ 2021-09-02 14:21                                     ` Peter Xu
  2021-09-02 14:57                                       ` Markus Armbruster
  2021-09-02 15:06                                       ` Daniel P. Berrangé
  0 siblings, 2 replies; 52+ messages in thread
From: Peter Xu @ 2021-09-02 14:21 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Igor Mammedov, Dr . David Alan Gilbert

Hi, Dan,

On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote:
> The correct ordering of devices/backends is generally pretty obvious
> for libvirt to determine. Most of the problems we've had related to
> ordering are on the QEMU side, because the ARGV given to QEMU made
> correct sense if parsed left-to-right, but QEMU didn't actually process
> them in that order. We've patched QEMU to hack around its inability to
> honour the CLI order repeatedly.

Is there a pointer to the problem?

> 
> Being completely self-ordering on the QEMU side using a topological
> sort would be neat from a conceptual purity POV, but that is quite a
> challenge to implement and I'm not convinced it is worth it, compared
> to other problems we want to spend time on.

I just noticed there can also be dependency between the buses; that cannot be
fixed by ordering of classes indeed as either proposed in this series, or
introduce a new priority.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-02 14:21                                     ` Peter Xu
@ 2021-09-02 14:57                                       ` Markus Armbruster
  2021-09-03 15:48                                         ` Peter Xu
  2021-09-02 15:06                                       ` Daniel P. Berrangé
  1 sibling, 1 reply; 52+ messages in thread
From: Markus Armbruster @ 2021-09-02 14:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini, Igor Mammedov

Peter Xu <peterx@redhat.com> writes:

> Hi, Dan,
>
> On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote:
>> The correct ordering of devices/backends is generally pretty obvious
>> for libvirt to determine. Most of the problems we've had related to
>> ordering are on the QEMU side, because the ARGV given to QEMU made
>> correct sense if parsed left-to-right, but QEMU didn't actually process
>> them in that order. We've patched QEMU to hack around its inability to
>> honour the CLI order repeatedly.
>
> Is there a pointer to the problem?

Just an example:

9ea18ed25a "vl: Fix -drive / -blockdev persistent reservation management
cda4aa9a5a "vl: Create block backends before setting machine properties"

>> Being completely self-ordering on the QEMU side using a topological
>> sort would be neat from a conceptual purity POV, but that is quite a
>> challenge to implement and I'm not convinced it is worth it, compared
>> to other problems we want to spend time on.
>
> I just noticed there can also be dependency between the buses; that cannot be
> fixed by ordering of classes indeed as either proposed in this series, or
> introduce a new priority.

--verbose?



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-02 14:21                                     ` Peter Xu
  2021-09-02 14:57                                       ` Markus Armbruster
@ 2021-09-02 15:06                                       ` Daniel P. Berrangé
  1 sibling, 0 replies; 52+ messages in thread
From: Daniel P. Berrangé @ 2021-09-02 15:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Igor Mammedov, Dr . David Alan Gilbert

On Thu, Sep 02, 2021 at 10:21:19AM -0400, Peter Xu wrote:
> Hi, Dan,
> 
> On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote:
> > The correct ordering of devices/backends is generally pretty obvious
> > for libvirt to determine. Most of the problems we've had related to
> > ordering are on the QEMU side, because the ARGV given to QEMU made
> > correct sense if parsed left-to-right, but QEMU didn't actually process
> > them in that order. We've patched QEMU to hack around its inability to
> > honour the CLI order repeatedly.
> 
> Is there a pointer to the problem?

The biggest problem was with -object, where some objects need to be
created before chardevs and some created after chardevs.
See object_create_early() method in softmmu/vl.c for the gross
hardcoded hacks


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-02 13:45                                 ` Peter Xu
  2021-09-02 13:53                                   ` Daniel P. Berrangé
@ 2021-09-02 15:26                                   ` Markus Armbruster
  2021-09-03 13:00                                   ` Igor Mammedov
  2 siblings, 0 replies; 52+ messages in thread
From: Markus Armbruster @ 2021-09-02 15:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini, Igor Mammedov

Peter Xu <peterx@redhat.com> writes:

> On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote:
>> On Mon, 30 Aug 2021 15:02:53 -0400
>> Peter Xu <peterx@redhat.com> wrote:
>> 
>> > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
>> > > > > A simple state machine can track "has IOMMU" state.  It has three states
>> > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
>> > > > > that needs to know".  State diagram:
>> > > > > 
>> > > > >                           no so far
>> > > > >                    +--- (start state) ---+
>> > > > >                    |                     |
>> > > > >          add IOMMU |                     | add device that
>> > > > >                    |                     |  needs to know
>> > > > >                    v                     v
>> > > > >              +--> yes                    no <--+
>> > > > >              |     |   add device that   |     |
>> > > > >              +-----+    needs to know    +-----+
>> > > > > 
>> > > > > "Add IOMMU" in state "no" is an error.  
>> > > > 
>> > > > question is how we distinguish "device that needs to know"
>> > > > from device that doesn't need to know, and then recently
>> > > > added feature 'bypass IOMMU' adds more fun to this.  
>> > > 
>> > > Maybe we can start from "no device needs to know"? Then add more into it when
>> > > the list expands.
>> > > 
>> > > vfio-pci should be a natural fit because we're sure it won't break any valid
>> > > old configurations.  However we may need to be careful on adding more devices,
>> > > e.g. when some old configuration might work on old binaries, but I'm not sure.  
>> > 
>> > Btw, I think this state machine is indeed bringing some complexity on even
>> > understanding it - it is definitely working but it's not obvious to anyone at
>> > the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
>> > yet another state machine for some other ordering constraints?
>> >
>> > From that POV, I don't like this solution more than the simple "assign priority
>> > for device realization" idea..
>> It seems simple but implicit dependencies are fragile (good or
>> I'd rather say bad example implicit dependencies is vl.c magic code,
>> where changing order of initialization can easily break QEMU,
>> which happens almost every time it's refactored),
>
> True.  I was never able of telling whether that comes from natural complexity
> of the piece of software or there's indeed some smarter moves.
>
>> and as Markus already mentioned it won't work in QMP case.
>
> I didn't ask before, but I do have a question on whether that's a real problem.
>
> When QMP interface is ready, we should have a last phase of "preconfig done"
> command, right?  I thought it was "x-exit-preconfig" now, but I'm not sure so
> am guessing.  If so, can we do the reorder there and postpone all device
> creations, maybe?  Then the ordering properties can be applied there too.

This would regress error reporting.

A significant advantage of doing configuration in basic steps is the
relative ease of pinpointing what exactly fails.

If you change device_add to merely record the request for later, you
delay the non-trivial failures until later.

Compare:

    -> {'execute':'device_add','arguments':{'driver':'virtio-blk',"drive": "foo"}}
    <- {"error": {"class": "GenericError", "desc": "Device needs media, but drive is empty"}}

to getting the same error in reply to x-exit-preconfig, with a dozen or
two device_add queued up.

The error comes from virtio_blk_device_realize(), which you propose to
delay until x-exit-preconfig.

In theory, we can rework all the errors in question to provide
sufficient context, and rework libvirt to pick the context apart, so it
can pinpoint what's wrong in the user's configuration.  In practice,
thanks, but no thanks.

> The other thing is I think only libvirt will use the QMP case even if it'll be
> ready, and libvirt will need to know the ordering already like vIOMMU and vfio
> and pci buses - even if a new qemu supported the ordering of all these, libvirt
> still need to support old qemu binaries (and the code is already there anyway)
> and it's fairly natural they initiate the QMP commands using the same ordering.

This is actually a pretty strong argument for not having QEMU reorder at
all.

> IMHO it means ordering for QMP is not as important as cmdline if that's always
> initiated by another software.
>
>> 
>> What could work for both cases is explicit dependencies,
>> however it would be hard to describe such dependency in this case,
>> where device can work both with and without IOMMU depending
>> on the bus settings it's attached to.
>> 
>> Have you considered another approach, i.e. instead of reordering,
>> change vfio-pci device model to reconfigure DMA address space
>> after realize time (ex: at reset time)?
>
> Yes, I agree this seems to be the cleanest appproach.  It may just need more
> changes and they'll be on vfio, and I'm not aware of the rest (Jason should
> know better on virtio/vdpa).
>
> The other thing is some dependency cannot be resolved by this, e.g., the pci
> bus issue where we put the pci bus to be after the device that it owns.  But
> indeed we're already early-fail that now so it seems ok.

Yes, that's a complete non-problem :)

> Side note: from my gut feeling I still think at some point we shouldn't do the
> early-failure for cases in the case of pci bus and device - the cmdline
> obviously contains enough information on the dependency on "this pci device
> needs that pci bus", and early-fail does seem to be an work-around to me just
> because they specified in the wrong order.

Again, there are two sane command line evaluation orders: (1) strictly
left to right, and (2) order doesn't matter.  QEMU of course does (3)
unpredicable for humans without referring back to the source code.

The difficulty with getting to (1) is mostly compatibility and putting
in the work.  Reporting when things don't get created in time nicely and
reliably is yet more work.  Not particularly challenging, just work.

Getting to a reliable version of (2) feels like a pipe dream to me.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-02 13:45                                 ` Peter Xu
  2021-09-02 13:53                                   ` Daniel P. Berrangé
  2021-09-02 15:26                                   ` Markus Armbruster
@ 2021-09-03 13:00                                   ` Igor Mammedov
  2021-09-03 16:03                                     ` Peter Xu
  2 siblings, 1 reply; 52+ messages in thread
From: Igor Mammedov @ 2021-09-03 13:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Thu, 2 Sep 2021 09:45:43 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote:
> > On Mon, 30 Aug 2021 15:02:53 -0400
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:  
> > > > > > A simple state machine can track "has IOMMU" state.  It has three states
> > > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > > > > that needs to know".  State diagram:
> > > > > > 
> > > > > >                           no so far
> > > > > >                    +--- (start state) ---+
> > > > > >                    |                     |
> > > > > >          add IOMMU |                     | add device that
> > > > > >                    |                     |  needs to know
> > > > > >                    v                     v
> > > > > >              +--> yes                    no <--+
> > > > > >              |     |   add device that   |     |
> > > > > >              +-----+    needs to know    +-----+
> > > > > > 
> > > > > > "Add IOMMU" in state "no" is an error.    
> > > > > 
> > > > > question is how we distinguish "device that needs to know"
> > > > > from device that doesn't need to know, and then recently
> > > > > added feature 'bypass IOMMU' adds more fun to this.    
> > > > 
> > > > Maybe we can start from "no device needs to know"? Then add more into it when
> > > > the list expands.
> > > > 
> > > > vfio-pci should be a natural fit because we're sure it won't break any valid
> > > > old configurations.  However we may need to be careful on adding more devices,
> > > > e.g. when some old configuration might work on old binaries, but I'm not sure.    
> > > 
> > > Btw, I think this state machine is indeed bringing some complexity on even
> > > understanding it - it is definitely working but it's not obvious to anyone at
> > > the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> > > yet another state machine for some other ordering constraints?
> > >
> > > From that POV, I don't like this solution more than the simple "assign priority
> > > for device realization" idea..  
> > It seems simple but implicit dependencies are fragile (good or
> > I'd rather say bad example implicit dependencies is vl.c magic code,
> > where changing order of initialization can easily break QEMU,
> > which happens almost every time it's refactored),  
> 
> True.  I was never able of telling whether that comes from natural complexity
> of the piece of software or there's indeed some smarter moves.
> 
> > and as Markus already mentioned it won't work in QMP case.  
> 
> I didn't ask before, but I do have a question on whether that's a real problem.
> 
> When QMP interface is ready, we should have a last phase of "preconfig done"
> command, right?  I thought it was "x-exit-preconfig" now, but I'm not sure so
> am guessing.  If so, can we do the reorder there and postpone all device
> creations, maybe?  Then the ordering properties can be applied there too.

I don't think "x-exit-preconfig" would help here, ideally devices would be
created before this stage, interactively via QMP.
So whoever is calling device_add, would need to know the proper ordering
or get error as result.

And even if we had explicit dependencies, they would be easier to use
with current error-out policy if something is out of expected order,
and tell user to fix CLI/QMP command order (and replace adhoc checks
use currently).


> The other thing is I think only libvirt will use the QMP case even if it'll be
> ready, and libvirt will need to know the ordering already like vIOMMU and vfio
> and pci buses - even if a new qemu supported the ordering of all these, libvirt
> still need to support old qemu binaries (and the code is already there anyway)
> and it's fairly natural they initiate the QMP commands using the same ordering.
> 
> IMHO it means ordering for QMP is not as important as cmdline if that's always
> initiated by another software.

PS:
If I'm not wrong, we are slowly working towards to composing machine
from QMP and ideally CLI becoming a shim on top of that. Hence the reason,
QMP is mentioned in this thread and desire to avoid more magic.


> > What could work for both cases is explicit dependencies,
> > however it would be hard to describe such dependency in this case,
> > where device can work both with and without IOMMU depending
> > on the bus settings it's attached to.
> > 
> > Have you considered another approach, i.e. instead of reordering,
> > change vfio-pci device model to reconfigure DMA address space
> > after realize time (ex: at reset time)?  
> 
> Yes, I agree this seems to be the cleanest appproach.  It may just need more
> changes and they'll be on vfio, and I'm not aware of the rest (Jason should
> know better on virtio/vdpa).

It's also a bit more closer "to real world", where IOMMU dependency is
configured by firmware/OS. In addition it doesn't require any machine
specific code, a device checks if its parent bus is managed by IOMMU
and it is configured accordingly to that.
 
> The other thing is some dependency cannot be resolved by this, e.g., the pci
> bus issue where we put the pci bus to be after the device that it owns.  But
> indeed we're already early-fail that now so it seems ok.
> 
> Side note: from my gut feeling I still think at some point we shouldn't do the
> early-failure for cases in the case of pci bus and device - the cmdline
> obviously contains enough information on the dependency on "this pci device
> needs that pci bus", and early-fail does seem to be an work-around to me just
> because they specified in the wrong order.
That's how QEMU CLI works most of the time (i.e. left to right order of
initialization) where it errors out on offending CLI option,
and tells user to fix it instead of trying to fixup CLI order.
So I think current pci_bus behavior does align with that.

PS:
Another, albeit machine depended approach to resolve IOMMU ordering problem
can be adding to a specific machine  pre_plug hook, an IOMMU handling.
Which is called during IOMMU realize time and check if existing buses
without bypass enabled (iommu managed) have any children. And if they
have devices attached, error out telling user to reorder '-device iommu'
before affected devices/bus.
It should cover mixed IOMMU+bypass case and doesn't require fixing
vfio-pci address space initialization nor defining any priorities
for PCI devices.

(but I think it's more a hack compared earlier suggested
address space initialization at reset time, and it would need to be
done for every affected machine)

> Thanks,
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-02 14:57                                       ` Markus Armbruster
@ 2021-09-03 15:48                                         ` Peter Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Xu @ 2021-09-03 15:48 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Dr . David Alan Gilbert, Eric Auger, Alex Williamson,
	Paolo Bonzini, Igor Mammedov

On Thu, Sep 02, 2021 at 04:57:04PM +0200, Markus Armbruster wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Hi, Dan,
> >
> > On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote:
> >> The correct ordering of devices/backends is generally pretty obvious
> >> for libvirt to determine. Most of the problems we've had related to
> >> ordering are on the QEMU side, because the ARGV given to QEMU made
> >> correct sense if parsed left-to-right, but QEMU didn't actually process
> >> them in that order. We've patched QEMU to hack around its inability to
> >> honour the CLI order repeatedly.
> >
> > Is there a pointer to the problem?
> 
> Just an example:
> 
> 9ea18ed25a "vl: Fix -drive / -blockdev persistent reservation management
> cda4aa9a5a "vl: Create block backends before setting machine properties"

Thanks, same to Dan.

> 
> >> Being completely self-ordering on the QEMU side using a topological
> >> sort would be neat from a conceptual purity POV, but that is quite a
> >> challenge to implement and I'm not convinced it is worth it, compared
> >> to other problems we want to spend time on.
> >
> > I just noticed there can also be dependency between the buses; that cannot be
> > fixed by ordering of classes indeed as either proposed in this series, or
> > introduce a new priority.
> 
> --verbose?

Please ignore it - I just started to realize what kind of a rabbit hole I'm
digging, and I am already prepared to run away. :)

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-03 13:00                                   ` Igor Mammedov
@ 2021-09-03 16:03                                     ` Peter Xu
  2021-09-06  8:49                                       ` Igor Mammedov
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-09-03 16:03 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Fri, Sep 03, 2021 at 03:00:05PM +0200, Igor Mammedov wrote:
> PS:
> Another, albeit machine depended approach to resolve IOMMU ordering problem
> can be adding to a specific machine  pre_plug hook, an IOMMU handling.
> Which is called during IOMMU realize time and check if existing buses
> without bypass enabled (iommu managed) have any children. And if they
> have devices attached, error out telling user to reorder '-device iommu'
> before affected devices/bus.
> It should cover mixed IOMMU+bypass case and doesn't require fixing
> vfio-pci address space initialization nor defining any priorities
> for PCI devices.

This sounds appealing among the approaches.

Does it need to be a pre_plug hook?  I thought we might just need a flag in the
pci device classes showing that it should be after vIOMMUs, then in vIOMMU
realize functions we walk pci bus to make sure no such device exist?

We could have a base vIOMMU class, then that could be in the realize() of the
common class.

> 
> (but I think it's more a hack compared earlier suggested
> address space initialization at reset time, and it would need to be
> done for every affected machine)

Agreed.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-09-03 16:03                                     ` Peter Xu
@ 2021-09-06  8:49                                       ` Igor Mammedov
  0 siblings, 0 replies; 52+ messages in thread
From: Igor Mammedov @ 2021-09-06  8:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang, qemu-devel,
	Markus Armbruster, Eric Auger, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

On Fri, 3 Sep 2021 12:03:06 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Fri, Sep 03, 2021 at 03:00:05PM +0200, Igor Mammedov wrote:
> > PS:
> > Another, albeit machine depended approach to resolve IOMMU ordering problem
> > can be adding to a specific machine  pre_plug hook, an IOMMU handling.
> > Which is called during IOMMU realize time and check if existing buses
> > without bypass enabled (iommu managed) have any children. And if they
> > have devices attached, error out telling user to reorder '-device iommu'
> > before affected devices/bus.
> > It should cover mixed IOMMU+bypass case and doesn't require fixing
> > vfio-pci address space initialization nor defining any priorities
> > for PCI devices.  
> 
> This sounds appealing among the approaches.

That's the easy one, compared to moving address space (re)initialization
to reset time (at least to me since vfio realize looks intimidating on
the first glance, but its maintainer(s) probably should know enough to
impl. change properly).

 
> Does it need to be a pre_plug hook?  I thought we might just need a flag in the
> pci device classes showing that it should be after vIOMMUs, then in vIOMMU
> realize functions we walk pci bus to make sure no such device exist?
> 
> We could have a base vIOMMU class, then that could be in the realize() of the
> common class.

We basically don't know if device needs IOMMU or not and can work
with/without it just fine. In this case I'd think about IOMMU as board
feature that morphs PCI buses (some of them) (address space, bus numers, ...).
So I don't perceive any iommu flag as a device property at all.

As for realize vs pre_plug, the later is the part of abstract realize
(see: device_set_realized) and is already used by some PCI infrastructure:
  ex: pcie_cap_slot_pre_plug_cb/spapr_pci_pre_plug

It's purpose is to check pre-conditions and possibly pre-configure some
some wiring on behalf of device's parent hot-plug handler (bus owner/machine),
and fail cleanly if something is wrong without leaving side effects.

See 0ed48fd32eb8 for boiler plate required to set up custom hot-plug handler.
You might need only parts of it, but still it's something that's to be done
for each affected machine type, to implement error checking at proper
layer.

So I'd rather look into 'reset' approach and only if that doesn't
look possible, resort to adding pre_plug/error check.


PS:
 yours d2321d31ff98b & c6cbc29d36f look to me like another candidate
 for pre_plug for pci deivice instead of adding dedicated hook
 just for vfio-pci to generic machine.
 
> > (but I think it's more a hack compared earlier suggested
> > address space initialization at reset time, and it would need to be
> > done for every affected machine)  
> 
> Agreed.
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/4] vl: Prioritize realizations of devices
  2021-08-25  4:23               ` Jason Wang
@ 2021-09-06  9:22                 ` Eric Auger
  0 siblings, 0 replies; 52+ messages in thread
From: Eric Auger @ 2021-09-06  9:22 UTC (permalink / raw)
  To: Jason Wang, Peter Xu
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	qemu-devel, Markus Armbruster, Alex Williamson, Paolo Bonzini,
	Dr . David Alan Gilbert

Hi,

On 8/25/21 6:23 AM, Jason Wang wrote:
> On Tue, Aug 24, 2021 at 11:50 PM Peter Xu <peterx@redhat.com> wrote:
>> On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote:
>>> It looks to me this doesn't solve the issue of using virtio-mmio with vhost?
>> No IOMMU supported for any of the MMIO devices, right?  Or am I wrong?  Thanks,
> I guess virtio IOMMU has that support, but I might be wrong.

No that's not supported yet in QEMU. There is an inflight contribution
attempting to support vSMMUv3 protection for sysbus devices though:
[PATCH v6 0/4] hw/arm/smmuv3: Support non PCI/PCIe devices

Thanks

Eric
>
> Thanks
>
>> --
>> Peter Xu
>>



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] vl: Prioritize device realizations
  2021-08-18 19:42 [PATCH 0/4] vl: Prioritize device realizations Peter Xu
                   ` (3 preceding siblings ...)
  2021-08-18 19:43 ` [PATCH 4/4] vl: Prioritize realizations of devices Peter Xu
@ 2021-10-20 13:44 ` David Hildenbrand
  2021-10-20 13:48   ` Daniel P. Berrangé
  2021-10-21  4:20   ` Peter Xu
  4 siblings, 2 replies; 52+ messages in thread
From: David Hildenbrand @ 2021-10-20 13:44 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, Dr . David Alan Gilbert, Eric Auger,
	Alex Williamson, Paolo Bonzini

On 18.08.21 21:42, Peter Xu wrote:
> This is a long pending issue that we haven't fixed.  The issue is in QEMU we
> have implicit device ordering requirement when realizing, otherwise some of the
> device may not work properly.
> 
> The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
> To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
> needs to be created before vfio-pci otherwise vfio-pci will stop working when
> the guest enables the vIOMMU and the device at the same time.
> 
> AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
> they need to pay attention or things will stop working at some point.
> 
> Recently there's a growing and similar requirement on vDPA.  It's not a hard
> requirement so far but vDPA has patches that try to workaround this issue.
> 
> This patchset allows us to realize the devices in the order that e.g. platform
> devices will be created first (bus device, IOMMU, etc.), then the rest of
> normal devices.  It's done simply by ordering the QemuOptsList of "device"
> entries before realization.  The priority so far comes from migration
> priorities which could be a little bit odd, but that's really about the same
> problem and we can clean that part up in the future.
> 
> Libvirt can still keep its ordering for sure so old QEMU will still work,
> however that won't be needed for new qemus after this patchset, so with the new
> binary we should be able to specify qemu cmdline as wish on '-device'.
> 
> Logically this should also work for vDPA and the workaround code can be done
> with more straightforward approaches.
> 
> Please review, thanks.

Hi Peter, looks like I have another use case:

vhost devices can heavily restrict the number of available memslots:
e.g., upstream KVM ~64k, vhost-user usually 32 (!). With virtio-mem
intending to make use of multiple memslots [1] and auto-detecting how
many to use based on currently avilable memslots when plugging and
realizing the virtio-mem device, this implies that realizing vhost
devices (especially vhost-user device) after virtio-mem devices can
similarly result in issues: when trying realization of the vhost device
with restricted memslots, QEMU will bail out.

So similarly, we'd want to realize any vhost-* before any virtio-mem device.

Do you have any updated version of this patchset? Thanks!

[1] https://lkml.kernel.org/r/20211013103330.26869-1-david@redhat.com


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] vl: Prioritize device realizations
  2021-10-20 13:44 ` [PATCH 0/4] vl: Prioritize device realizations David Hildenbrand
@ 2021-10-20 13:48   ` Daniel P. Berrangé
  2021-10-20 13:58     ` David Hildenbrand
  2021-10-21  4:20   ` Peter Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Daniel P. Berrangé @ 2021-10-20 13:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, Peter Xu, qemu-devel, Eric Auger,
	Alex Williamson, Paolo Bonzini, Dr . David Alan Gilbert

On Wed, Oct 20, 2021 at 03:44:08PM +0200, David Hildenbrand wrote:
> On 18.08.21 21:42, Peter Xu wrote:
> > This is a long pending issue that we haven't fixed.  The issue is in QEMU we
> > have implicit device ordering requirement when realizing, otherwise some of the
> > device may not work properly.
> > 
> > The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
> > To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
> > needs to be created before vfio-pci otherwise vfio-pci will stop working when
> > the guest enables the vIOMMU and the device at the same time.
> > 
> > AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
> > they need to pay attention or things will stop working at some point.
> > 
> > Recently there's a growing and similar requirement on vDPA.  It's not a hard
> > requirement so far but vDPA has patches that try to workaround this issue.
> > 
> > This patchset allows us to realize the devices in the order that e.g. platform
> > devices will be created first (bus device, IOMMU, etc.), then the rest of
> > normal devices.  It's done simply by ordering the QemuOptsList of "device"
> > entries before realization.  The priority so far comes from migration
> > priorities which could be a little bit odd, but that's really about the same
> > problem and we can clean that part up in the future.
> > 
> > Libvirt can still keep its ordering for sure so old QEMU will still work,
> > however that won't be needed for new qemus after this patchset, so with the new
> > binary we should be able to specify qemu cmdline as wish on '-device'.
> > 
> > Logically this should also work for vDPA and the workaround code can be done
> > with more straightforward approaches.
> > 
> > Please review, thanks.
> 
> Hi Peter, looks like I have another use case:
> 
> vhost devices can heavily restrict the number of available memslots:
> e.g., upstream KVM ~64k, vhost-user usually 32 (!). With virtio-mem
> intending to make use of multiple memslots [1] and auto-detecting how
> many to use based on currently avilable memslots when plugging and
> realizing the virtio-mem device, this implies that realizing vhost
> devices (especially vhost-user device) after virtio-mem devices can
> similarly result in issues: when trying realization of the vhost device
> with restricted memslots, QEMU will bail out.
> 
> So similarly, we'd want to realize any vhost-* before any virtio-mem device.

Ordering virtio-mem vs vhost-* devices doesn't feel like a good
solution to this problem. eg if you start a guest with several
vhost-* devices, then virtio-mem auto-decides to use all/most
remaining memslots, we've now surely broken the abiltiy to then
hotplug more vhost-* devices at runtime by not leaving memslots
for them.

I think virtio-mem configuration needs to be stable in its memslot
usage regardless of how many other types of devices are present,
and not auto-adjust how many it consumes.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] vl: Prioritize device realizations
  2021-10-20 13:48   ` Daniel P. Berrangé
@ 2021-10-20 13:58     ` David Hildenbrand
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2021-10-20 13:58 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, Peter Xu, qemu-devel, Eric Auger,
	Alex Williamson, Paolo Bonzini, Dr . David Alan Gilbert

On 20.10.21 15:48, Daniel P. Berrangé wrote:
> On Wed, Oct 20, 2021 at 03:44:08PM +0200, David Hildenbrand wrote:
>> On 18.08.21 21:42, Peter Xu wrote:
>>> This is a long pending issue that we haven't fixed.  The issue is in QEMU we
>>> have implicit device ordering requirement when realizing, otherwise some of the
>>> device may not work properly.
>>>
>>> The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
>>> To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
>>> needs to be created before vfio-pci otherwise vfio-pci will stop working when
>>> the guest enables the vIOMMU and the device at the same time.
>>>
>>> AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
>>> they need to pay attention or things will stop working at some point.
>>>
>>> Recently there's a growing and similar requirement on vDPA.  It's not a hard
>>> requirement so far but vDPA has patches that try to workaround this issue.
>>>
>>> This patchset allows us to realize the devices in the order that e.g. platform
>>> devices will be created first (bus device, IOMMU, etc.), then the rest of
>>> normal devices.  It's done simply by ordering the QemuOptsList of "device"
>>> entries before realization.  The priority so far comes from migration
>>> priorities which could be a little bit odd, but that's really about the same
>>> problem and we can clean that part up in the future.
>>>
>>> Libvirt can still keep its ordering for sure so old QEMU will still work,
>>> however that won't be needed for new qemus after this patchset, so with the new
>>> binary we should be able to specify qemu cmdline as wish on '-device'.
>>>
>>> Logically this should also work for vDPA and the workaround code can be done
>>> with more straightforward approaches.
>>>
>>> Please review, thanks.
>>
>> Hi Peter, looks like I have another use case:
>>
>> vhost devices can heavily restrict the number of available memslots:
>> e.g., upstream KVM ~64k, vhost-user usually 32 (!). With virtio-mem
>> intending to make use of multiple memslots [1] and auto-detecting how
>> many to use based on currently avilable memslots when plugging and
>> realizing the virtio-mem device, this implies that realizing vhost
>> devices (especially vhost-user device) after virtio-mem devices can
>> similarly result in issues: when trying realization of the vhost device
>> with restricted memslots, QEMU will bail out.
>>
>> So similarly, we'd want to realize any vhost-* before any virtio-mem device.
> 
> Ordering virtio-mem vs vhost-* devices doesn't feel like a good
> solution to this problem. eg if you start a guest with several
> vhost-* devices, then virtio-mem auto-decides to use all/most
> remaining memslots, we've now surely broken the abiltiy to then
> hotplug more vhost-* devices at runtime by not leaving memslots
> for them.

You can hotplug vhost-* devices devices as you want; they don't
"consume" memslots, they can only restrict the number of total memslots
if they provide less..

We have this situation today already:

Coldplug/hotplug > 32 DIMMs to a VM. Then hotplug a vhost-user device
that's based on libvhost-user or rust's vhost-user-backend. Hotplug will
fail.

Nothing really different with virtio-mem, except that you can configure
how many memslots it should actually use if you care about above situation.

> 
> I think virtio-mem configuration needs to be stable in its memslot
> usage regardless of how many other types of devices are present,
> and not auto-adjust how many it consumes.

There is a parameter to limit the number of memslots a virtio-mem device
can use ("max-memslots") to handle such corner-case environments as you
describe.

Set to 1 - exactly one ("old behavior").
Set to 0 - auto-detect.
Set to > 1 - auto detect and cap at the given value.

99.999% of all users don't care about hotplug of limiting vhost devices
and will happily use "0". The remainder can be handled via realization
priority. Nothing to confuse ordinary users with IMHO.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] vl: Prioritize device realizations
  2021-10-20 13:44 ` [PATCH 0/4] vl: Prioritize device realizations David Hildenbrand
  2021-10-20 13:48   ` Daniel P. Berrangé
@ 2021-10-21  4:20   ` Peter Xu
  2021-10-21  7:17     ` David Hildenbrand
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-10-21  4:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, qemu-devel, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

On Wed, Oct 20, 2021 at 03:44:08PM +0200, David Hildenbrand wrote:
> On 18.08.21 21:42, Peter Xu wrote:
> > This is a long pending issue that we haven't fixed.  The issue is in QEMU we
> > have implicit device ordering requirement when realizing, otherwise some of the
> > device may not work properly.
> > 
> > The initial requirement comes from when vfio-pci starts to work with vIOMMUs.
> > To make sure vfio-pci will get the correct DMA address space, the vIOMMU device
> > needs to be created before vfio-pci otherwise vfio-pci will stop working when
> > the guest enables the vIOMMU and the device at the same time.
> > 
> > AFAIU Libvirt should have code that guarantees that.  For QEMU cmdline users,
> > they need to pay attention or things will stop working at some point.
> > 
> > Recently there's a growing and similar requirement on vDPA.  It's not a hard
> > requirement so far but vDPA has patches that try to workaround this issue.
> > 
> > This patchset allows us to realize the devices in the order that e.g. platform
> > devices will be created first (bus device, IOMMU, etc.), then the rest of
> > normal devices.  It's done simply by ordering the QemuOptsList of "device"
> > entries before realization.  The priority so far comes from migration
> > priorities which could be a little bit odd, but that's really about the same
> > problem and we can clean that part up in the future.
> > 
> > Libvirt can still keep its ordering for sure so old QEMU will still work,
> > however that won't be needed for new qemus after this patchset, so with the new
> > binary we should be able to specify qemu cmdline as wish on '-device'.
> > 
> > Logically this should also work for vDPA and the workaround code can be done
> > with more straightforward approaches.
> > 
> > Please review, thanks.
> 
> Hi Peter, looks like I have another use case:

Hi, David,

> 
> vhost devices can heavily restrict the number of available memslots:
> e.g., upstream KVM ~64k, vhost-user usually 32 (!). With virtio-mem
> intending to make use of multiple memslots [1] and auto-detecting how
> many to use based on currently avilable memslots when plugging and
> realizing the virtio-mem device, this implies that realizing vhost
> devices (especially vhost-user device) after virtio-mem devices can
> similarly result in issues: when trying realization of the vhost device
> with restricted memslots, QEMU will bail out.
> 
> So similarly, we'd want to realize any vhost-* before any virtio-mem device.
> 
> Do you have any updated version of this patchset? Thanks!

Yes I should follow this up, thanks for asking.

Though after Markus and Igor pointed out to me that it's much more than types
of device and objects to order, I don't have a good way to fix the ordering
issue for good on all the problems; obviously current solution only applies to
device class ordering.

Examples that Markus provided:

https://lore.kernel.org/qemu-devel/87ilzj81q7.fsf@dusky.pond.sub.org/

Also there can be inter-dependency issue too for single device class, e.g., for
pci buses if bus pcie.2 has a parent pci bus of pcie.1, then we must speficy
the "-device" for pcie.1 before the "-device" of pcie.2, otherwise qemu will
fail to boot too.

Any of above examples means ordering based on device class can only solve
partial of the problems, not all.

And I can buy in with what people worry about on having yet another way to fix
ordering, since the root issue is still unsettled, even if the current solution
seems to work for vIOMMU/vfio, and I had a feeling it could work too with the
virtio-mem issue you're tackling with.

My plan is to move on with what Igor suggested, on using either pre_plug hook
for vIOMMU to make sure no special devices like vfio are realized before that.
I think it'll be as silly as a pci bus scan on all the pcie host bridges
looking for vfio-pci, it can even be put directly into realize() I think as I
don't see an obvious difference on failing pre_plug() or realize() so far.
Then I'll just drop this series so the new version may not really help with
virtio-mem anymore; though not sure virtio-mem can do similar thing.

One step back, OTOH, I do second on what Daniel commented in the other thread
on leaving that problem to the user; sad to know that we already have pmem
restriction so hot plugging some device already start to fail, but maybe
failing is fine as long as nothing will crash? :)

I also do think it's nice to at least allow the user to specify the exact value
of virtio-mem slot number without any smart tricks to be played when the user
wants - I think it's still okay to do automatic detection, but that's already
part of "policy" not "mechanism" to me so imho it should be better optional,
and now I had a feeling that maybe qemu should be good enough on providing
these mechanisms first then we leave the rest of the problems to libvirt, maybe
that's a better place to do all these sanity checks and doing smart things on
deciding the slot numbers.  For qemu failing at the right point without
interrupting the guest seems to be good enough so far.

I think "early failing" seems to not be a problem for virtio-mem already since
if there's a conflict on the slot number then e.g. vhost-user will already fail
early, not sure whether it means it's good enough.  For vIOMMU I may need to
work on the other bus scanning patchset to make sure when vfio is specified
before vIOMMU then we should fail qemu early, and that's still missing.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] vl: Prioritize device realizations
  2021-10-21  4:20   ` Peter Xu
@ 2021-10-21  7:17     ` David Hildenbrand
  2021-10-21  8:00       ` Peter Xu
  0 siblings, 1 reply; 52+ messages in thread
From: David Hildenbrand @ 2021-10-21  7:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, qemu-devel, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

> Yes I should follow this up, thanks for asking.
> 
> Though after Markus and Igor pointed out to me that it's much more than types
> of device and objects to order, I don't have a good way to fix the ordering
> issue for good on all the problems; obviously current solution only applies to
> device class ordering.
> 
> Examples that Markus provided:
> 
> https://lore.kernel.org/qemu-devel/87ilzj81q7.fsf@dusky.pond.sub.org/
> 
> Also there can be inter-dependency issue too for single device class, e.g., for
> pci buses if bus pcie.2 has a parent pci bus of pcie.1, then we must speficy
> the "-device" for pcie.1 before the "-device" of pcie.2, otherwise qemu will
> fail to boot too.
> 
Interesting case :)

> Any of above examples means ordering based on device class can only solve
> partial of the problems, not all.
> 
> And I can buy in with what people worry about on having yet another way to fix
> ordering, since the root issue is still unsettled, even if the current solution
> seems to work for vIOMMU/vfio, and I had a feeling it could work too with the
> virtio-mem issue you're tackling with.

The question is if we need to get it all 100% right from the start. To
me, for example, the single device class issue is a whole different beast.

I know, whenever someone proposes a way to tackle part of a challenging
problem, everybody discovers their hopes and dreams and suddenly you
have to go all the way to solve the complete problem. The end result is
that there is no improvement at all instead of incremental improvement.

> 
> My plan is to move on with what Igor suggested, on using either pre_plug hook
> for vIOMMU to make sure no special devices like vfio are realized before that.
> I think it'll be as silly as a pci bus scan on all the pcie host bridges
> looking for vfio-pci, it can even be put directly into realize() I think as I
> don't see an obvious difference on failing pre_plug() or realize() so far.
> Then I'll just drop this series so the new version may not really help with
> virtio-mem anymore; though not sure virtio-mem can do similar thing.

In case of virtio-mem, we'll already fail properly when realizing the
vhost-* device and figuring out that the memslot limit the device
imposes isn't going to work. So what you would want to have for vIOMMU
is already in place (fail instead of silently continue).

> 
> One step back, OTOH, I do second on what Daniel commented in the other thread
> on leaving that problem to the user; sad to know that we already have pmem
> restriction so hot plugging some device already start to fail, but maybe
> failing is fine as long as nothing will crash? :)

I very much agree.

There are two cases:

1. QEMU failing to start because vhost-* was specified after virtio-mem.
We get an indication that points at the number of memslots. And as the
user explicitly enabled e.g., "max-memslots=0", I think that's fair enough.

2. Hotplug of vhost-* devices failing. We similarly get an indication
that points at the number of memslots. Similarly, I think that's fair
enough. The guest will continue working just fine.

The result of that discussion is that the default should always be
"max-memslots=1" and that users have to opt in manually.

> 
> I also do think it's nice to at least allow the user to specify the exact value
> of virtio-mem slot number without any smart tricks to be played when the user
> wants - I think it's still okay to do automatic detection, but that's already
> part of "policy" not "mechanism" to me so imho it should be better optional,
> and now I had a feeling that maybe qemu should be good enough on providing
> these mechanisms first then we leave the rest of the problems to libvirt, maybe
> that's a better place to do all these sanity checks and doing smart things on
> deciding the slot numbers.  For qemu failing at the right point without
> interrupting the guest seems to be good enough so far.

I'm not planning on letting the user set the actual number of memslots
to use, only an upper limit. But to me, it's fundamentally the same: the
user has to enable this behavior explicitly.

> 
> I think "early failing" seems to not be a problem for virtio-mem already since
> if there's a conflict on the slot number then e.g. vhost-user will already fail
> early, not sure whether it means it's good enough.  For vIOMMU I may need to
> work on the other bus scanning patchset to make sure when vfio is specified
> before vIOMMU then we should fail qemu early, and that's still missing.

Right, thanks!


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] vl: Prioritize device realizations
  2021-10-21  7:17     ` David Hildenbrand
@ 2021-10-21  8:00       ` Peter Xu
  2021-10-21 16:54         ` David Hildenbrand
  0 siblings, 1 reply; 52+ messages in thread
From: Peter Xu @ 2021-10-21  8:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, qemu-devel, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

On Thu, Oct 21, 2021 at 09:17:57AM +0200, David Hildenbrand wrote:
> I know, whenever someone proposes a way to tackle part of a challenging
> problem, everybody discovers their hopes and dreams and suddenly you
> have to go all the way to solve the complete problem. The end result is
> that there is no improvement at all instead of incremental improvement.

Yeah, there's the trade-off; we either not moving forward or otherwise we could
potentially bring (more) chaos so the code is less maintainable.  Before I'm
sure I won't do the latter and convince the others, I need to hold off a bit. :-)

> I'm not planning on letting the user set the actual number of memslots
> to use, only an upper limit. But to me, it's fundamentally the same: the
> user has to enable this behavior explicitly.

I'm not familiar enough on virtio-mem's side, it's just that it will stop
working when the ideal value (even in a very corner case) is less than the
maximum specified, then that trick stops people from specifying the ideal.  But
if it's bigger the better then indeed I don't see much to worry.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/4] vl: Prioritize device realizations
  2021-10-21  8:00       ` Peter Xu
@ 2021-10-21 16:54         ` David Hildenbrand
  0 siblings, 0 replies; 52+ messages in thread
From: David Hildenbrand @ 2021-10-21 16:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Jason Wang,
	Markus Armbruster, qemu-devel, Eric Auger, Alex Williamson,
	Paolo Bonzini, Dr . David Alan Gilbert

On 21.10.21 10:00, Peter Xu wrote:
> On Thu, Oct 21, 2021 at 09:17:57AM +0200, David Hildenbrand wrote:
>> I know, whenever someone proposes a way to tackle part of a challenging
>> problem, everybody discovers their hopes and dreams and suddenly you
>> have to go all the way to solve the complete problem. The end result is
>> that there is no improvement at all instead of incremental improvement.
> 
> Yeah, there's the trade-off; we either not moving forward or otherwise we could
> potentially bring (more) chaos so the code is less maintainable.  Before I'm
> sure I won't do the latter and convince the others, I need to hold off a bit. :-)

Sure :)

>> I'm not planning on letting the user set the actual number of memslots
>> to use, only an upper limit. But to me, it's fundamentally the same: the
>> user has to enable this behavior explicitly.
> 
> I'm not familiar enough on virtio-mem's side, it's just that it will stop
> working when the ideal value (even in a very corner case) is less than the
> maximum specified, then that trick stops people from specifying the ideal.  But
> if it's bigger the better then indeed I don't see much to worry.

Usually it's "the bigger the better", but there are a lot of exceptions,
and error handling on weird user input is a little hairy ... but I'm
playing with it right now, essentially having

"memslots=0" -> auto detect as good as possible
"memslots=1" -> default
"memslits>1" -> use user input, bail out if some conditions aren't met.
Especially, fail plugging if there are not sufficient free memslots around.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2021-10-21 17:01 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18 19:42 [PATCH 0/4] vl: Prioritize device realizations Peter Xu
2021-08-18 19:42 ` [PATCH 1/4] qdev-monitor: Trace qdev creation Peter Xu
2021-08-18 19:43 ` [PATCH 2/4] qemu-config: Allow in-place sorting of QemuOptsList Peter Xu
2021-08-18 19:43 ` [PATCH 3/4] qdev: Export qdev_get_device_class() Peter Xu
2021-08-18 19:43 ` [PATCH 4/4] vl: Prioritize realizations of devices Peter Xu
2021-08-23 18:49   ` Eduardo Habkost
2021-08-23 19:18     ` Peter Xu
2021-08-23 21:07       ` Eduardo Habkost
2021-08-23 21:31         ` Peter Xu
2021-08-23 21:54           ` Michael S. Tsirkin
2021-08-23 22:51             ` Peter Xu
2021-08-23 21:56           ` Eduardo Habkost
2021-08-23 23:05             ` Peter Xu
2021-08-25  9:39               ` Markus Armbruster
2021-08-25 12:28                 ` Markus Armbruster
2021-08-25 21:50                   ` Peter Xu
2021-08-26  3:50                     ` Peter Xu
2021-08-26  8:01                       ` Markus Armbruster
2021-08-26 11:36                         ` Igor Mammedov
2021-08-26 13:43                           ` Peter Xu
2021-08-30 19:02                             ` Peter Xu
2021-08-31 11:35                               ` Markus Armbruster
2021-09-02  8:26                               ` Igor Mammedov
2021-09-02 13:45                                 ` Peter Xu
2021-09-02 13:53                                   ` Daniel P. Berrangé
2021-09-02 14:21                                     ` Peter Xu
2021-09-02 14:57                                       ` Markus Armbruster
2021-09-03 15:48                                         ` Peter Xu
2021-09-02 15:06                                       ` Daniel P. Berrangé
2021-09-02 15:26                                   ` Markus Armbruster
2021-09-03 13:00                                   ` Igor Mammedov
2021-09-03 16:03                                     ` Peter Xu
2021-09-06  8:49                                       ` Igor Mammedov
2021-09-02  7:46                             ` Igor Mammedov
2021-08-26  4:57                     ` Markus Armbruster
2021-08-23 22:05       ` Michael S. Tsirkin
2021-08-23 22:36         ` Peter Xu
2021-08-24  2:52           ` Jason Wang
2021-08-24 15:50             ` Peter Xu
2021-08-25  4:23               ` Jason Wang
2021-09-06  9:22                 ` Eric Auger
2021-08-24 16:24         ` David Hildenbrand
2021-08-24 19:52           ` Peter Xu
2021-08-25  8:08             ` David Hildenbrand
2021-08-24  2:51       ` Jason Wang
2021-10-20 13:44 ` [PATCH 0/4] vl: Prioritize device realizations David Hildenbrand
2021-10-20 13:48   ` Daniel P. Berrangé
2021-10-20 13:58     ` David Hildenbrand
2021-10-21  4:20   ` Peter Xu
2021-10-21  7:17     ` David Hildenbrand
2021-10-21  8:00       ` Peter Xu
2021-10-21 16:54         ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.