qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Migration documentation
@ 2023-05-15  8:31 Juan Quintela
  2023-05-15  8:31 ` [PATCH v3 1/3] migration: Add documentation for backwards compatiblity Juan Quintela
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Juan Quintela @ 2023-05-15  8:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Leonardo Bras, Peter Xu, Jiri Denemark,
	Avihai Horon, Juan Quintela, Fiona Ebner,
	Daniel P . Berrangé

Hi

Continuing on my saga to not have a pointer about how migration
works/should work, this are the doc for the long mails that I had to do last week:

1 - Document how backwards compatibility is supposed to work (that is
    what it was on v1 and v2)

2 - Document how to handle features that depend of things outside of qemu
    a.k.a. we make this a libvirt/management app problem.

3 - Document how we are supposed to fix backwards compatibility
    brokenness.

You can see that I was able to find commits on tree to illustrate 1
and 3.  The problem for 2 appeared from a discussion with Avihai about
future vfio devices.  Do anyone have a good idea about a commit
already on tree to document it?

Please review.

Later, Juan.

Juan Quintela (3):
  migration: Add documentation for backwards compatiblity
  migration/docs: How to migrate when hosts have different features
  migration/doc: We broke backwards compatibility

 docs/devel/migration.rst | 503 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 503 insertions(+)

-- 
2.40.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/3] migration: Add documentation for backwards compatiblity
  2023-05-15  8:31 [PATCH v3 0/3] Migration documentation Juan Quintela
@ 2023-05-15  8:31 ` Juan Quintela
  2023-05-16 23:39   ` Peter Xu
  2023-05-15  8:32 ` [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features Juan Quintela
  2023-05-15  8:32 ` [PATCH v3 3/3] migration/doc: We broke backwards compatibility Juan Quintela
  2 siblings, 1 reply; 20+ messages in thread
From: Juan Quintela @ 2023-05-15  8:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Leonardo Bras, Peter Xu, Jiri Denemark,
	Avihai Horon, Juan Quintela, Fiona Ebner,
	Daniel P . Berrangé,
	Vladimir Sementsov-Ogievskiy

State what are the requeriments to get migration working between qemu
versions.  And once there explain how one is supposed to implement a
new feature/default value and not break migration.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20230511082701.12828-1-quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 docs/devel/migration.rst | 216 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 216 insertions(+)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index 6f65c23b47..b4c4f3ec35 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -142,6 +142,222 @@ General advice for device developers
   may be different on the destination.  This can result in the
   device state being loaded into the wrong device.
 
+How backwards compatibility works
+---------------------------------
+
+When we do migration, we have to QEMU process: the source and the
+target.  There are two cases, they are the same version or they are a
+different version.  The easy case is when they are the same version.
+The difficult one is when they are different versions.
+
+There are two things that are different, but they have very similar
+names and sometimes get confused:
+- QEMU version
+- machine version
+
+Let's start with a practical example, we start with:
+
+- qemu-system-x86_64 (v5.2), from now on qemu-5.2.
+- qemu-system-x86_64 (v5.1), from now on qemu-5.1.
+
+Related to this are the "latest" machine types defined on each of
+them:
+
+- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
+- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1
+
+First of all, migration is only supposed to work if you use the same
+machine type in both source and destination. The QEMU hardware
+configuration needs to be the same also on source and destination.
+Most aspects of the backend configuration can be changed at will,
+except for a few cases where the backend features influence frontend
+device feature exposure.  But that is not relevant for this section.
+
+I am going to list the number of combinations that we can have.  Let's
+start with the trivial ones, QEMU is the same on source and
+destination:
+
+1 - qemu-5.2 -M pc-5.2  -> migrates to -> qemu-5.2 -M pc-5.2
+
+  This is the latest QEMU with the latest machine type.
+  This have to work, and if it doesn't work it is a bug.
+
+2 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
+
+  Exactly the same case than the previous one, but for 5.1.
+  Nothing to see here either.
+
+This are the easiest ones, we will not talk more about them in this
+section.
+
+Now we start with the more interesting cases.  Consider the case where
+we have the same QEMU version in both sides (qemu-5.2) but we are using
+the latest machine type for that version (pc-5.2) but one of an older
+QEMU version, in this case pc-5.1.
+
+3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
+
+  It needs to use the definition of pc-5.1 and the devices as they
+  were configured on 5.1, but this should be easy in the sense that
+  both sides are the same QEMU and both sides have exactly the same
+  idea of what the pc-5.1 machine is.
+
+4 - qemu-5.1 -M pc-5.2  -> migrates to -> qemu-5.1 -M pc-5.2
+
+  This combination is not possible as the qemu-5.1 doen't understand
+  pc-5.2 machine type.  So nothing to worry here.
+
+Now it comes the interesting ones, when both QEMU processes are
+different.  Notice also that the machine type needs to be pc-5.1,
+because we have the limitation than qemu-5.1 doesn't know pc-5.2.  So
+the possible cases are:
+
+5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
+
+  This migration is known as newer to older.  We need to make sure
+  when we are developing 5.2 we need to take care about not to break
+  migration to qemu-5.1.  Notice that we can't make updates to
+  qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is
+  in qemu-5.2 side to make the relevant changes.
+
+6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
+
+  This migration is known as older to newer.  We need to make sure
+  than we are able to receive migrations from qemu-5.1. The problem is
+  similar to the previous one.
+
+If qemu-5.1 and qemu-5.2 were the same, there will not be any
+compatibility problems.  But the reason that we create qemu-5.2 is to
+get new features, devices, defaults, etc.
+
+If we get a device that has a new feature, or change a default value,
+we have a problem when we try to migrate between different QEMU
+versions.
+
+So we need a way to tell qemu-5.2 that when we are using machine type
+pc-5.1, it needs to **not** use the feature, to be able to migrate to
+real qemu-5.1.
+
+And the equivalent part when migrating from qemu-5.1 to qemu-5.2.
+qemu-5.2 has to expect that it is not going to get data for the new
+feature, because qemu-5.1 doesn't know about it.
+
+How do we tell QEMU about these device feature changes?  In
+hw/core/machine.c:hw_compat_X_Y arrays.
+
+If we change a default value, we need to put back the old value on
+that array.  And the device, during initialization needs to look at
+that array to see what value it needs to get for that feature.  And
+what are we going to put in that array, the value of a property.
+
+To create a property for a device, we need to use one of the
+DEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the
+macros that exist.  With it, we set the default value for that
+property, and that is what it is going to get in the latest released
+version.  But if we want a different value for a previous version, we
+can change that in the hw_compat_X_Y arrays.
+
+hw_compat_X_Y is an array of registers that have the format:
+
+- name_device
+- name_property
+- value
+
+Let's see a practical example.
+
+In qemu-5.2 virtio-blk-device got multi queue support.  This is a
+change that is not backward compatible.  In qemu-5.1 it has one
+queue. In qemu-5.2 it has the same number of queues as the number of
+cpus in the system.
+
+When we are doing migration, if we migrate from a device that has 4
+queues to a device that have only one queue, we don't know where to
+put the extra information for the other 3 queues, and we fail
+migration.
+
+Similar problem when we migrate from qemu-5.1 that has only one queue
+to qemu-5.2, we only sent information for one queue, but destination
+has 4, and we have 3 queues that are not properly initialized and
+anything can happen.
+
+So, how can we address this problem.  Easy, just convince qemu-5.2
+that when it is running pc-5.1, it needs to set the number of queues
+for virtio-blk-devices to 1.
+
+That way we fix the cases 5 and 6.
+
+5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
+
+    qemu-5.2 -M pc-5.1 sets number of queues to be 1.
+    qemu-5.1 -M pc-5.1 expects number of queues to be 1.
+
+    correct.  migration works.
+
+6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
+
+    qemu-5.1 -M pc-5.1 sets number of queues to be 1.
+    qemu-5.2 -M pc-5.1 expects number of queues to be 1.
+
+    correct.  migration works.
+
+And now the other interesting case, case 3.  In this case we have:
+
+3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
+
+    Here we have the same QEMU in both sides.  So it doesn't matter a
+    lot if we have set the number of queues to 1 or not, because
+    they are the same.
+
+    WRONG!
+
+    Think what happens if we do one of this double migrations:
+
+    A -> migrates -> B -> migrates -> C
+
+    where:
+
+    A: qemu-5.1 -M pc-5.1
+    B: qemu-5.2 -M pc-5.1
+    C: qemu-5.2 -M pc-5.1
+
+    migration A -> B is case 6, so number of queues needs to be 1.
+
+    migration B -> C is case 3, so we don't care.  But actually we
+    care because we haven't started the guest in qemu-5.2, it came
+    migrated from qemu-5.1.  So to be in the safe place, we need to
+    always use number of queues 1 when we are using pc-5.1.
+
+Now, how was this done in reality?  The following commit shows how it
+was done.
+
+commit 9445e1e15e66c19e42bea942ba810db28052cd05
+Author: Stefan Hajnoczi <stefanha@redhat.com>
+Date:   Tue Aug 18 15:33:47 2020 +0100
+
+    virtio-blk-pci: default num_queues to -smp N
+
+The relevant parts for migration are: ::
+
+    @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = {
+     #endif
+         DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0,
+                         true),
+    -    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1),
+    +    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues,
+    +                       VIRTIO_BLK_AUTO_NUM_QUEUES),
+         DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256),
+
+It changes the default value of num_queues.  But it fishes it for old
+machine types to have the right value: ::
+
+    @@ -31,6 +31,7 @@
+     GlobalProperty hw_compat_5_1[] = {
+         ...
+    +    { "virtio-blk-device", "num-queues", "1"},
+         ...
+     };
+
+
 VMState
 -------
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features
  2023-05-15  8:31 [PATCH v3 0/3] Migration documentation Juan Quintela
  2023-05-15  8:31 ` [PATCH v3 1/3] migration: Add documentation for backwards compatiblity Juan Quintela
@ 2023-05-15  8:32 ` Juan Quintela
  2023-05-16 23:51   ` Peter Xu
  2023-05-17 10:23   ` Michael S. Tsirkin
  2023-05-15  8:32 ` [PATCH v3 3/3] migration/doc: We broke backwards compatibility Juan Quintela
  2 siblings, 2 replies; 20+ messages in thread
From: Juan Quintela @ 2023-05-15  8:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Leonardo Bras, Peter Xu, Jiri Denemark,
	Avihai Horon, Juan Quintela, Fiona Ebner,
	Daniel P . Berrangé

Sometimes devices have different features depending of things outside
of qemu.  For instance the kernel.  Document how to handle that cases.

Signed-off-by: Juan Quintela <quintela@redhat.com>

---

If you have some example to put here, I am all ears.  I guess that
virtio-* with some features that are on qemu but not on all kernel
would do the trick, but I am not a virtio guru myself.  Patches
welcome.
---
 docs/devel/migration.rst | 93 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index b4c4f3ec35..95e797ee60 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -357,6 +357,99 @@ machine types to have the right value: ::
          ...
      };
 
+A device with diferent features on both sides
+---------------------------------------------
+
+Let's assume that we are using the same QEMU binary on both sides,
+just to make the things easier.  But we have a device that has
+different features on both sides of the migration.  That can be
+because the devices are different, because the kernel driver of both
+devices have different features, whatever.
+
+How can we get this to work with migration.  The way to do that is
+"theoretically" easy.  You have to get the features that the device
+has in the source of the migration.  The features that the device has
+on the target of the migration, you get the intersection of the
+features of both sides, and that is the way that you should launch
+qemu.
+
+Notice that this is not completely related to qemu.  The most
+important thing here is that this should be handle by the managing
+application that launches qemu.  If qemu is configured correctly, the
+migration will suceeed.
+
+Once that we have defined that, doing this is complicated.  Almost all
+devices are bad at being able to be launched with only some features
+enabled.  With one big exception: cpus.
+
+You can read the documentation for QEMU x86 cpu models here:
+
+https://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html
+
+See when they talk about migration they recommend that one chooses the
+newest cpu model that is supported for all cpus.
+
+Let's say that we have:
+
+Host A:
+
+Device X has the feature Y
+
+Host B:
+
+Device X has not the feature Y
+
+If we try to migrate without any care from host A to host B, it will
+fail because when migration tries to load the feature Y on
+destination, it will find that the hardware is not there.
+
+Doing this would be the equivalent of doing with cpus:
+
+Host A:
+
+$ qemu-system-x86_64 -cpu host
+
+Host B:
+
+$ qemu-system-x86_64 -cpu host
+
+When both hosts have different cpu features this is waranteed to fail.
+Especially if Host B has less features than host A.  If host A has
+less features than host B, sometimes it works.  Important word of last
+sentence is "sometimes".
+
+So, forgetting about cpu models and continuing with the -cpu host
+example, let's see that the differences of the cpus is that Host A and
+B have the following features:
+
+Features:   'pcid'  'stibp' 'taa-no'
+Host A:        X       X
+Host B:                        X
+
+And we want to migrate between them, the way configure both qemu cpu
+will be:
+
+Host A:
+
+$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off
+
+Host B:
+
+$ qemu-system-x86_64 -cpu host,taa-no=off
+
+And you would be able to migrate between them.  It is responsability
+of the management application or of the user to make sure that the
+configuration is correct.  QEMU don't know how to look at this kind of
+features in general.
+
+Other devices have worse control about individual features.  If they
+want to be able to migrate between hosts that show different features,
+the device needs a way to configure which ones it is going to use.
+
+In this section we have considered that we are using the same QEMU
+binary in both sides of the migration.  If we use different QEMU
+versions process, then we need to have into account all other
+differences and the examples become even more complicated.
 
 VMState
 -------
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-15  8:31 [PATCH v3 0/3] Migration documentation Juan Quintela
  2023-05-15  8:31 ` [PATCH v3 1/3] migration: Add documentation for backwards compatiblity Juan Quintela
  2023-05-15  8:32 ` [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features Juan Quintela
@ 2023-05-15  8:32 ` Juan Quintela
  2023-05-17  0:03   ` Peter Xu
                     ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Juan Quintela @ 2023-05-15  8:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Leonardo Bras, Peter Xu, Jiri Denemark,
	Avihai Horon, Juan Quintela, Fiona Ebner,
	Daniel P . Berrangé

When we detect that we have broken backwards compantibility in a
released version, we can't do anything for that version.  But once we
fix that bug on the next released version, we can "mitigate" that
problem when migrating to new versions to give a way out of that
machine until it does a hard reboot.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 194 insertions(+)

diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index 95e797ee60..97b6f48474 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -451,6 +451,200 @@ binary in both sides of the migration.  If we use different QEMU
 versions process, then we need to have into account all other
 differences and the examples become even more complicated.
 
+How to mitigate when we have a backward compatibility error
+-----------------------------------------------------------
+
+We broke migration for old machine types continously during
+development.  But as soon as we find that there is a problem, we fix
+it.  The problem is what happens when we detect after we have done a
+release that something has gone wrong.
+
+Let see how it worked with one example.
+
+After the release of qemu-8.0 we found a problem when doing migration
+of the machine type pc-7.2.
+
+- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
+
+  This migration works
+
+- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
+
+  This migration works
+
+- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
+
+  This migration fails
+
+- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
+
+  This migration fails
+
+So clearly something fails when migration between qemu-7.2 and
+qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
+pointed to this commit.
+
+In qemu-8.0 we got this commit: ::
+
+    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
+    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
+    Date:   Thu Mar 2 13:37:03 2023 +0000
+
+        hw/pci/aer: Add missing routing for AER errors
+
+The relevant bits of the commit for our example are this ones:
+
+    --- a/hw/pci/pcie_aer.c
+    +++ b/hw/pci/pcie_aer.c
+    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
+
+         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
+                      PCI_ERR_UNC_SUPPORTED);
+    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
+    +                 PCI_ERR_UNC_MASK_DEFAULT);
+    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
+    +                 PCI_ERR_UNC_SUPPORTED);
+
+         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
+                     PCI_ERR_UNC_SEVERITY_DEFAULT);
+
+The patch changes how we configure pci space for AER.  But qemu fails
+when the pci space configuration is different betwwen source and
+destination.
+
+The following commit show how this got fixed:
+
+<put info of the commit once that it arrives upstream>
+
+The relevant parts of the fix are as follow:
+
+First, we create a new property for the device to be able to configure
+the old behaviour or the new behaviour. ::
+
+    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
+    index 8a87ccc8b0..5153ad63d6 100644
+    --- a/hw/pci/pci.c
+    +++ b/hw/pci/pci.c
+    @@ -79,6 +79,8 @@ static Property pci_props[] = {
+         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
+                            failover_pair_id),
+         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
+    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
+    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
+         DEFINE_PROP_END_OF_LIST()
+     };
+
+Notice that we enable te feature for new machine types.
+
+Now we see how the fix is done.  This is going to depend on what kind
+of breakage happens, but in this case it is quite simple. ::
+
+    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
+    index 103667c368..374d593ead 100644
+    --- a/hw/pci/pcie_aer.c
+    +++ b/hw/pci/pcie_aer.c
+    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
+    uint16_t offset,
+
+         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
+                      PCI_ERR_UNC_SUPPORTED);
+    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
+    -                 PCI_ERR_UNC_MASK_DEFAULT);
+    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
+    -                 PCI_ERR_UNC_SUPPORTED);
+    +
+    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
+    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
+    +                     PCI_ERR_UNC_MASK_DEFAULT);
+    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
+    +                     PCI_ERR_UNC_SUPPORTED);
+    +    }
+
+         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
+                      PCI_ERR_UNC_SEVERITY_DEFAULT);
+
+I.e. If the property bit is enabled, we configure it as we did for
+qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
+
+And now, everything that is missing is disable the feature for old
+machine types: ::
+
+    diff --git a/hw/core/machine.c b/hw/core/machine.c
+    index 47a34841a5..07f763eb2e 100644
+    --- a/hw/core/machine.c
+    +++ b/hw/core/machine.c
+    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
+         { "e1000e", "migrate-timadj", "off" },
+         { "virtio-mem", "x-early-migration", "false" },
+         { "migration", "x-preempt-pre-7-2", "true" },
+    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
+     };
+     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
+
+And now, when qemu-8.0.1 is released with this fix, all combinations
+are going to work as supposed.
+
+- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
+- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
+- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
+- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
+
+So the normality has been restaured and everything is ok, no?
+
+Not really, now our matrix is much bigger.  We started with the easy
+cases, migration from the same version to the same version always
+works:
+
+- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
+- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
+- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
+
+Now the interesting ones.  When the QEMU processes versions are
+different.  For the 1st set, their fail and we can do nothing, both
+versions are relased and we can't change anything.
+
+- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
+- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
+
+This two are the ones that work. The whole point of making the
+change in qemu-8.0.1 release was to fix this issue:
+
+- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
+- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
+
+But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
+qemu-8.0.1.
+
+- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
+- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
+
+So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
+anything except to qemu-8.0.
+
+Can we do better?
+
+Yeap.  If we know that we are gonig to do this migration:
+
+- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
+
+We can launche the appropiate devices with
+
+--device...,x-pci-e-err-unc-mask=on
+
+And now we can receive a migration from 8.0.  And from now on, we can
+do that migration to new machine types if we remember to enable that
+property for pc-7.2.  Notice that we need to remember, it is not
+enough to know that the source of the migration is qemu-8.0.  Think of this example:
+
+$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
+
+In the second migration, the source is not qemu-8.0, but we still have
+that "problem" and have that property enabled.  Notice that we need to
+continue having this mark/property until we have this machine
+rebooted.  But it is not a normal reboot (that don't reload qemu) we
+need the mapchine to poweroff/poweron on a fixed qemu.  And from now
+on we can use the proper real machine.
+
 VMState
 -------
 
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] migration: Add documentation for backwards compatiblity
  2023-05-15  8:31 ` [PATCH v3 1/3] migration: Add documentation for backwards compatiblity Juan Quintela
@ 2023-05-16 23:39   ` Peter Xu
  2023-05-18  1:47     ` Xiaoyao Li
  2023-10-23 11:09     ` Juan Quintela
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Xu @ 2023-05-16 23:39 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Michael S . Tsirkin, Leonardo Bras, Jiri Denemark,
	Avihai Horon, Fiona Ebner, Daniel P . Berrangé,
	Vladimir Sementsov-Ogievskiy

On Mon, May 15, 2023 at 10:31:59AM +0200, Juan Quintela wrote:
> State what are the requeriments to get migration working between qemu
> versions.  And once there explain how one is supposed to implement a
> new feature/default value and not break migration.
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
> Message-Id: <20230511082701.12828-1-quintela@redhat.com>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  docs/devel/migration.rst | 216 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 216 insertions(+)
> 
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index 6f65c23b47..b4c4f3ec35 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -142,6 +142,222 @@ General advice for device developers
>    may be different on the destination.  This can result in the
>    device state being loaded into the wrong device.
>  
> +How backwards compatibility works
> +---------------------------------
> +
> +When we do migration, we have to QEMU process: the source and the

s/to/two/, s/process/processes/

> +target.  There are two cases, they are the same version or they are a
> +different version.

s/a different version/different versions/

> +The easy case is when they are the same version.
> +The difficult one is when they are different versions.
> +
> +There are two things that are different, but they have very similar
> +names and sometimes get confused:

(space)

> +- QEMU version
> +- machine version

It's normally called "machine type", so maybe use that?  Or just "machine
version / machine type"?

> +
> +Let's start with a practical example, we start with:
> +
> +- qemu-system-x86_64 (v5.2), from now on qemu-5.2.
> +- qemu-system-x86_64 (v5.1), from now on qemu-5.1.
> +
> +Related to this are the "latest" machine types defined on each of
> +them:
> +
> +- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
> +- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1
> +
> +First of all, migration is only supposed to work if you use the same
> +machine type in both source and destination. The QEMU hardware
> +configuration needs to be the same also on source and destination.
> +Most aspects of the backend configuration can be changed at will,
> +except for a few cases where the backend features influence frontend
> +device feature exposure.  But that is not relevant for this section.
> +
> +I am going to list the number of combinations that we can have.  Let's
> +start with the trivial ones, QEMU is the same on source and
> +destination:
> +
> +1 - qemu-5.2 -M pc-5.2  -> migrates to -> qemu-5.2 -M pc-5.2
> +
> +  This is the latest QEMU with the latest machine type.
> +  This have to work, and if it doesn't work it is a bug.
> +
> +2 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
> +
> +  Exactly the same case than the previous one, but for 5.1.
> +  Nothing to see here either.
> +
> +This are the easiest ones, we will not talk more about them in this
> +section.
> +
> +Now we start with the more interesting cases.  Consider the case where
> +we have the same QEMU version in both sides (qemu-5.2) but we are using
> +the latest machine type for that version (pc-5.2) but one of an older
> +QEMU version, in this case pc-5.1.
> +
> +3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
> +
> +  It needs to use the definition of pc-5.1 and the devices as they
> +  were configured on 5.1, but this should be easy in the sense that
> +  both sides are the same QEMU and both sides have exactly the same
> +  idea of what the pc-5.1 machine is.
> +
> +4 - qemu-5.1 -M pc-5.2  -> migrates to -> qemu-5.1 -M pc-5.2
> +
> +  This combination is not possible as the qemu-5.1 doen't understand
> +  pc-5.2 machine type.  So nothing to worry here.
> +
> +Now it comes the interesting ones, when both QEMU processes are
> +different.  Notice also that the machine type needs to be pc-5.1,
> +because we have the limitation than qemu-5.1 doesn't know pc-5.2.  So
> +the possible cases are:
> +
> +5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
> +
> +  This migration is known as newer to older.  We need to make sure
> +  when we are developing 5.2 we need to take care about not to break
> +  migration to qemu-5.1.  Notice that we can't make updates to
> +  qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is
> +  in qemu-5.2 side to make the relevant changes.
> +
> +6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
> +
> +  This migration is known as older to newer.  We need to make sure
> +  than we are able to receive migrations from qemu-5.1. The problem is
> +  similar to the previous one.
> +
> +If qemu-5.1 and qemu-5.2 were the same, there will not be any
> +compatibility problems.  But the reason that we create qemu-5.2 is to
> +get new features, devices, defaults, etc.
> +
> +If we get a device that has a new feature, or change a default value,
> +we have a problem when we try to migrate between different QEMU
> +versions.
> +
> +So we need a way to tell qemu-5.2 that when we are using machine type
> +pc-5.1, it needs to **not** use the feature, to be able to migrate to
> +real qemu-5.1.
> +
> +And the equivalent part when migrating from qemu-5.1 to qemu-5.2.
> +qemu-5.2 has to expect that it is not going to get data for the new
> +feature, because qemu-5.1 doesn't know about it.
> +
> +How do we tell QEMU about these device feature changes?  In
> +hw/core/machine.c:hw_compat_X_Y arrays.
> +
> +If we change a default value, we need to put back the old value on
> +that array.  And the device, during initialization needs to look at
> +that array to see what value it needs to get for that feature.  And
> +what are we going to put in that array, the value of a property.
> +
> +To create a property for a device, we need to use one of the
> +DEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the
> +macros that exist.  With it, we set the default value for that
> +property, and that is what it is going to get in the latest released
> +version.  But if we want a different value for a previous version, we
> +can change that in the hw_compat_X_Y arrays.
> +
> +hw_compat_X_Y is an array of registers that have the format:
> +
> +- name_device
> +- name_property
> +- value
> +
> +Let's see a practical example.
> +
> +In qemu-5.2 virtio-blk-device got multi queue support.  This is a
> +change that is not backward compatible.  In qemu-5.1 it has one
> +queue. In qemu-5.2 it has the same number of queues as the number of
> +cpus in the system.
> +
> +When we are doing migration, if we migrate from a device that has 4
> +queues to a device that have only one queue, we don't know where to
> +put the extra information for the other 3 queues, and we fail
> +migration.
> +
> +Similar problem when we migrate from qemu-5.1 that has only one queue
> +to qemu-5.2, we only sent information for one queue, but destination
> +has 4, and we have 3 queues that are not properly initialized and
> +anything can happen.
> +
> +So, how can we address this problem.  Easy, just convince qemu-5.2
> +that when it is running pc-5.1, it needs to set the number of queues
> +for virtio-blk-devices to 1.
> +
> +That way we fix the cases 5 and 6.
> +
> +5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
> +
> +    qemu-5.2 -M pc-5.1 sets number of queues to be 1.
> +    qemu-5.1 -M pc-5.1 expects number of queues to be 1.
> +
> +    correct.  migration works.
> +
> +6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
> +
> +    qemu-5.1 -M pc-5.1 sets number of queues to be 1.
> +    qemu-5.2 -M pc-5.1 expects number of queues to be 1.
> +
> +    correct.  migration works.
> +
> +And now the other interesting case, case 3.  In this case we have:
> +
> +3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
> +
> +    Here we have the same QEMU in both sides.  So it doesn't matter a
> +    lot if we have set the number of queues to 1 or not, because
> +    they are the same.
> +
> +    WRONG!
> +
> +    Think what happens if we do one of this double migrations:
> +
> +    A -> migrates -> B -> migrates -> C
> +
> +    where:
> +
> +    A: qemu-5.1 -M pc-5.1
> +    B: qemu-5.2 -M pc-5.1
> +    C: qemu-5.2 -M pc-5.1
> +
> +    migration A -> B is case 6, so number of queues needs to be 1.
> +
> +    migration B -> C is case 3, so we don't care.  But actually we
> +    care because we haven't started the guest in qemu-5.2, it came
> +    migrated from qemu-5.1.  So to be in the safe place, we need to
> +    always use number of queues 1 when we are using pc-5.1.
> +
> +Now, how was this done in reality?  The following commit shows how it
> +was done.
> +
> +commit 9445e1e15e66c19e42bea942ba810db28052cd05
> +Author: Stefan Hajnoczi <stefanha@redhat.com>
> +Date:   Tue Aug 18 15:33:47 2020 +0100
> +
> +    virtio-blk-pci: default num_queues to -smp N
> +
> +The relevant parts for migration are: ::
> +
> +    @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = {
> +     #endif
> +         DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0,
> +                         true),
> +    -    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1),
> +    +    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues,
> +    +                       VIRTIO_BLK_AUTO_NUM_QUEUES),
> +         DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256),
> +
> +It changes the default value of num_queues.  But it fishes it for old
> +machine types to have the right value: ::
> +
> +    @@ -31,6 +31,7 @@
> +     GlobalProperty hw_compat_5_1[] = {
> +         ...
> +    +    { "virtio-blk-device", "num-queues", "1"},
> +         ...
> +     };
> +
> +

This is definitely more detailed than I thought. :)

Acked-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features
  2023-05-15  8:32 ` [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features Juan Quintela
@ 2023-05-16 23:51   ` Peter Xu
  2023-10-17 14:05     ` Juan Quintela
  2023-05-17 10:23   ` Michael S. Tsirkin
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Xu @ 2023-05-16 23:51 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Michael S . Tsirkin, Leonardo Bras, Jiri Denemark,
	Avihai Horon, Fiona Ebner, Daniel P . Berrangé

On Mon, May 15, 2023 at 10:32:00AM +0200, Juan Quintela wrote:
> Sometimes devices have different features depending of things outside
> of qemu.  For instance the kernel.  Document how to handle that cases.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> 
> ---
> 
> If you have some example to put here, I am all ears.  I guess that
> virtio-* with some features that are on qemu but not on all kernel
> would do the trick, but I am not a virtio guru myself.  Patches
> welcome.
> ---
>  docs/devel/migration.rst | 93 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 93 insertions(+)
> 
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index b4c4f3ec35..95e797ee60 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -357,6 +357,99 @@ machine types to have the right value: ::
>           ...
>       };
>  
> +A device with diferent features on both sides
> +---------------------------------------------
> +
> +Let's assume that we are using the same QEMU binary on both sides,
> +just to make the things easier.  But we have a device that has
> +different features on both sides of the migration.  That can be
> +because the devices are different, because the kernel driver of both
> +devices have different features, whatever.
> +
> +How can we get this to work with migration.  The way to do that is
> +"theoretically" easy.  You have to get the features that the device
> +has in the source of the migration.  The features that the device has
> +on the target of the migration, you get the intersection of the
> +features of both sides, and that is the way that you should launch
> +qemu.
> +
> +Notice that this is not completely related to qemu.  The most
> +important thing here is that this should be handle by the managing
> +application that launches qemu.  If qemu is configured correctly, the
> +migration will suceeed.
> +
> +Once that we have defined that, doing this is complicated.  Almost all
> +devices are bad at being able to be launched with only some features
> +enabled.  With one big exception: cpus.
> +
> +You can read the documentation for QEMU x86 cpu models here:
> +
> +https://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html
> +
> +See when they talk about migration they recommend that one chooses the
> +newest cpu model that is supported for all cpus.
> +
> +Let's say that we have:
> +
> +Host A:
> +
> +Device X has the feature Y
> +
> +Host B:
> +
> +Device X has not the feature Y
> +
> +If we try to migrate without any care from host A to host B, it will
> +fail because when migration tries to load the feature Y on
> +destination, it will find that the hardware is not there.
> +
> +Doing this would be the equivalent of doing with cpus:
> +
> +Host A:
> +
> +$ qemu-system-x86_64 -cpu host
> +
> +Host B:
> +
> +$ qemu-system-x86_64 -cpu host
> +
> +When both hosts have different cpu features this is waranteed to fail.
> +Especially if Host B has less features than host A.  If host A has
> +less features than host B, sometimes it works.  Important word of last
> +sentence is "sometimes".
> +
> +So, forgetting about cpu models and continuing with the -cpu host
> +example, let's see that the differences of the cpus is that Host A and
> +B have the following features:
> +
> +Features:   'pcid'  'stibp' 'taa-no'
> +Host A:        X       X
> +Host B:                        X
> +
> +And we want to migrate between them, the way configure both qemu cpu
> +will be:
> +
> +Host A:
> +
> +$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off
> +
> +Host B:
> +
> +$ qemu-system-x86_64 -cpu host,taa-no=off

Since we're using cpu as example, shall we at least mention at the end that
we don't suggest using -cpu host if migration is needed?

> +
> +And you would be able to migrate between them.  It is responsability
> +of the management application or of the user to make sure that the
> +configuration is correct.  QEMU don't know how to look at this kind of
> +features in general.
> +
> +Other devices have worse control about individual features.  If they
> +want to be able to migrate between hosts that show different features,
> +the device needs a way to configure which ones it is going to use.
> +
> +In this section we have considered that we are using the same QEMU
> +binary in both sides of the migration.  If we use different QEMU
> +versions process, then we need to have into account all other
> +differences and the examples become even more complicated.

Mostly good to me.  What I worry is how much help this will bring to
developers - I'd assume developers working on these will be aware of this.
But I guess it's always good to have any documentation than nothing.

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-15  8:32 ` [PATCH v3 3/3] migration/doc: We broke backwards compatibility Juan Quintela
@ 2023-05-17  0:03   ` Peter Xu
  2023-10-17 14:18     ` Juan Quintela
  2023-05-17  7:09   ` Fiona Ebner
  2023-05-17 10:20   ` Michael S. Tsirkin
  2 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2023-05-17  0:03 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Michael S . Tsirkin, Leonardo Bras, Jiri Denemark,
	Avihai Horon, Fiona Ebner, Daniel P . Berrangé

On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote:
> When we detect that we have broken backwards compantibility in a
> released version, we can't do anything for that version.  But once we
> fix that bug on the next released version, we can "mitigate" that
> problem when migrating to new versions to give a way out of that
> machine until it does a hard reboot.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 194 insertions(+)
> 
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index 95e797ee60..97b6f48474 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -451,6 +451,200 @@ binary in both sides of the migration.  If we use different QEMU
>  versions process, then we need to have into account all other
>  differences and the examples become even more complicated.
>  
> +How to mitigate when we have a backward compatibility error
> +-----------------------------------------------------------
> +
> +We broke migration for old machine types continously during

continuously

> +development.  But as soon as we find that there is a problem, we fix
> +it.  The problem is what happens when we detect after we have done a
> +release that something has gone wrong.
> +
> +Let see how it worked with one example.
> +
> +After the release of qemu-8.0 we found a problem when doing migration
> +of the machine type pc-7.2.
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +  This migration works
> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +
> +  This migration works
> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +  This migration fails
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +
> +  This migration fails
> +
> +So clearly something fails when migration between qemu-7.2 and
> +qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
> +pointed to this commit.
> +
> +In qemu-8.0 we got this commit: ::
> +
> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> +    Date:   Thu Mar 2 13:37:03 2023 +0000
> +
> +        hw/pci/aer: Add missing routing for AER errors

Worst timing ever for him.. :(

The lesson is never break migration when the maintainer has any intention
to add some docs explaining backward compatibility.

> +
> +The relevant bits of the commit for our example are this ones:
> +
> +    --- a/hw/pci/pcie_aer.c
> +    +++ b/hw/pci/pcie_aer.c
> +    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
> +
> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> +                      PCI_ERR_UNC_SUPPORTED);
> +    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> +    +                 PCI_ERR_UNC_MASK_DEFAULT);
> +    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> +    +                 PCI_ERR_UNC_SUPPORTED);
> +
> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> +                     PCI_ERR_UNC_SEVERITY_DEFAULT);
> +
> +The patch changes how we configure pci space for AER.  But qemu fails
> +when the pci space configuration is different betwwen source and
> +destination.
> +
> +The following commit show how this got fixed:
> +
> +<put info of the commit once that it arrives upstream>
> +
> +The relevant parts of the fix are as follow:
> +
> +First, we create a new property for the device to be able to configure
> +the old behaviour or the new behaviour. ::
> +
> +    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> +    index 8a87ccc8b0..5153ad63d6 100644
> +    --- a/hw/pci/pci.c
> +    +++ b/hw/pci/pci.c
> +    @@ -79,6 +79,8 @@ static Property pci_props[] = {
> +         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
> +                            failover_pair_id),
> +         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
> +    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
> +    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> +         DEFINE_PROP_END_OF_LIST()
> +     };
> +
> +Notice that we enable te feature for new machine types.

the

> +
> +Now we see how the fix is done.  This is going to depend on what kind
> +of breakage happens, but in this case it is quite simple. ::
> +
> +    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
> +    index 103667c368..374d593ead 100644
> +    --- a/hw/pci/pcie_aer.c
> +    +++ b/hw/pci/pcie_aer.c
> +    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
> +    uint16_t offset,
> +
> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> +                      PCI_ERR_UNC_SUPPORTED);
> +    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> +    -                 PCI_ERR_UNC_MASK_DEFAULT);
> +    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> +    -                 PCI_ERR_UNC_SUPPORTED);
> +    +
> +    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
> +    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> +    +                     PCI_ERR_UNC_MASK_DEFAULT);
> +    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> +    +                     PCI_ERR_UNC_SUPPORTED);
> +    +    }
> +
> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> +                      PCI_ERR_UNC_SEVERITY_DEFAULT);
> +
> +I.e. If the property bit is enabled, we configure it as we did for
> +qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
> +
> +And now, everything that is missing is disable the feature for old

disabling

> +machine types: ::
> +
> +    diff --git a/hw/core/machine.c b/hw/core/machine.c
> +    index 47a34841a5..07f763eb2e 100644
> +    --- a/hw/core/machine.c
> +    +++ b/hw/core/machine.c
> +    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
> +         { "e1000e", "migrate-timadj", "off" },
> +         { "virtio-mem", "x-early-migration", "false" },
> +         { "migration", "x-preempt-pre-7-2", "true" },
> +    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
> +     };
> +     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
> +
> +And now, when qemu-8.0.1 is released with this fix, all combinations
> +are going to work as supposed.
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> +
> +So the normality has been restaured and everything is ok, no?
> +
> +Not really, now our matrix is much bigger.  We started with the easy
> +cases, migration from the same version to the same version always
> +works:
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +
> +Now the interesting ones.  When the QEMU processes versions are
> +different.  For the 1st set, their fail and we can do nothing, both
> +versions are relased and we can't change anything.
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +This two are the ones that work. The whole point of making the
> +change in qemu-8.0.1 release was to fix this issue:
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
> +qemu-8.0.1.
> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +
> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
> +anything except to qemu-8.0.
> +
> +Can we do better?
> +
> +Yeap.  If we know that we are gonig to do this migration:

IIUC the other thing one should do is always keep their QEMU binaries
uptodate.  E.g., anyone seriously using 8.0 released QEMU should always
consider to do timely upgrade to e.g. 8.0.1 so this issue can also be
avoided (by dropping 8.0 directly).

> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +
> +We can launche the appropiate devices with

launch

> +
> +--device...,x-pci-e-err-unc-mask=on
> +
> +And now we can receive a migration from 8.0.  And from now on, we can
> +do that migration to new machine types if we remember to enable that
> +property for pc-7.2.  Notice that we need to remember, it is not
> +enough to know that the source of the migration is qemu-8.0.  Think of this example:

(wrap)

> +
> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
> +
> +In the second migration, the source is not qemu-8.0, but we still have
> +that "problem" and have that property enabled.  Notice that we need to
> +continue having this mark/property until we have this machine
> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now
> +on we can use the proper real machine.
> +

The 8.0.1 breaking migration to 8.0 is a very important point to mention
indeed.

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-15  8:32 ` [PATCH v3 3/3] migration/doc: We broke backwards compatibility Juan Quintela
  2023-05-17  0:03   ` Peter Xu
@ 2023-05-17  7:09   ` Fiona Ebner
  2023-10-23 11:09     ` Juan Quintela
  2023-05-17 10:20   ` Michael S. Tsirkin
  2 siblings, 1 reply; 20+ messages in thread
From: Fiona Ebner @ 2023-05-17  7:09 UTC (permalink / raw)
  To: Juan Quintela, qemu-devel
  Cc: Michael S . Tsirkin, Leonardo Bras, Peter Xu, Jiri Denemark,
	Avihai Horon, Daniel P . Berrangé

Am 15.05.23 um 10:32 schrieb Juan Quintela:
> When we detect that we have broken backwards compantibility in a

compatibility

(...)

> +
> +In qemu-8.0 we got this commit: ::
> +
> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> +    Date:   Thu Mar 2 13:37:03 2023 +0000
> +
> +        hw/pci/aer: Add missing routing for AER errors
> +
> +The relevant bits of the commit for our example are this ones:
> +
> +    --- a/hw/pci/pcie_aer.c
> +    +++ b/hw/pci/pcie_aer.c
> +    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
> +
> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> +                      PCI_ERR_UNC_SUPPORTED);
> +    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> +    +                 PCI_ERR_UNC_MASK_DEFAULT);
> +    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> +    +                 PCI_ERR_UNC_SUPPORTED);
> +
> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> +                     PCI_ERR_UNC_SEVERITY_DEFAULT);
> +

These changes are not part of commit
9a6ef182c0 ("hw/pci/aer: Add missing routing for AER errors")
but rather the one before it, namely
010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register")

> +The patch changes how we configure pci space for AER.  But qemu fails

Should QEMU and PCI be capitalized in the text parts?

> +when the pci space configuration is different betwwen source and

between

> +destination.
> +
> +The following commit show how this got fixed:

shows

(...)

> +
> +So the normality has been restaured and everything is ok, no?

restored

> +
> +Not really, now our matrix is much bigger.  We started with the easy
> +cases, migration from the same version to the same version always
> +works:
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +
> +Now the interesting ones.  When the QEMU processes versions are
> +different.  For the 1st set, their fail and we can do nothing, both
> +versions are relased and we can't change anything.

released

> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +This two are the ones that work. The whole point of making the
> +change in qemu-8.0.1 release was to fix this issue:
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
> +qemu-8.0.1.
> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +
> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
> +anything except to qemu-8.0.
> +
> +Can we do better?
> +
> +Yeap.  If we know that we are gonig to do this migration:

going

> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +
> +We can launche the appropiate devices with

"launch" was already pointed out by Peter, but there's also "appropriate"

> +
> +--device...,x-pci-e-err-unc-mask=on
> +
> +And now we can receive a migration from 8.0.  And from now on, we can
> +do that migration to new machine types if we remember to enable that
> +property for pc-7.2.  Notice that we need to remember, it is not
> +enough to know that the source of the migration is qemu-8.0.  Think of this example:
> +
> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
> +
> +In the second migration, the source is not qemu-8.0, but we still have
> +that "problem" and have that property enabled.  Notice that we need to
> +continue having this mark/property until we have this machine
> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now

machine

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-15  8:32 ` [PATCH v3 3/3] migration/doc: We broke backwards compatibility Juan Quintela
  2023-05-17  0:03   ` Peter Xu
  2023-05-17  7:09   ` Fiona Ebner
@ 2023-05-17 10:20   ` Michael S. Tsirkin
  2023-05-17 11:43     ` Juan Quintela
  2 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-05-17 10:20 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Leonardo Bras, Peter Xu, Jiri Denemark, Avihai Horon,
	Fiona Ebner, Daniel P . Berrangé

On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote:
> When we detect that we have broken backwards compantibility in a
> released version, we can't do anything for that version.  But once we
> fix that bug on the next released version, we can "mitigate" that
> problem when migrating to new versions to give a way out of that
> machine until it does a hard reboot.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 194 insertions(+)
> 
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index 95e797ee60..97b6f48474 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -451,6 +451,200 @@ binary in both sides of the migration.  If we use different QEMU
>  versions process, then we need to have into account all other
>  differences and the examples become even more complicated.
>  
> +How to mitigate when we have a backward compatibility error
> +-----------------------------------------------------------
> +
> +We broke migration for old machine types continously during
> +development.  But as soon as we find that there is a problem, we fix
> +it.  The problem is what happens when we detect after we have done a
> +release that something has gone wrong.
> +
> +Let see how it worked with one example.
> +
> +After the release of qemu-8.0 we found a problem when doing migration
> +of the machine type pc-7.2.
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +  This migration works
> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +
> +  This migration works
> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +  This migration fails
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +
> +  This migration fails
> +
> +So clearly something fails when migration between qemu-7.2 and
> +qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
> +pointed to this commit.
> +
> +In qemu-8.0 we got this commit: ::
> +
> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> +    Date:   Thu Mar 2 13:37:03 2023 +0000
> +
> +        hw/pci/aer: Add missing routing for AER errors
> +
> +The relevant bits of the commit for our example are this ones:
> +
> +    --- a/hw/pci/pcie_aer.c
> +    +++ b/hw/pci/pcie_aer.c
> +    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
> +
> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> +                      PCI_ERR_UNC_SUPPORTED);
> +    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> +    +                 PCI_ERR_UNC_MASK_DEFAULT);
> +    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> +    +                 PCI_ERR_UNC_SUPPORTED);
> +
> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> +                     PCI_ERR_UNC_SEVERITY_DEFAULT);
> +
> +The patch changes how we configure pci space for AER.  But qemu fails
> +when the pci space configuration is different betwwen source and
> +destination.
> +
> +The following commit show how this got fixed:
> +
> +<put info of the commit once that it arrives upstream>
> +
> +The relevant parts of the fix are as follow:
> +
> +First, we create a new property for the device to be able to configure
> +the old behaviour or the new behaviour. ::
> +
> +    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> +    index 8a87ccc8b0..5153ad63d6 100644
> +    --- a/hw/pci/pci.c
> +    +++ b/hw/pci/pci.c
> +    @@ -79,6 +79,8 @@ static Property pci_props[] = {
> +         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
> +                            failover_pair_id),
> +         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
> +    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
> +    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> +         DEFINE_PROP_END_OF_LIST()
> +     };
> +
> +Notice that we enable te feature for new machine types.
> +
> +Now we see how the fix is done.  This is going to depend on what kind
> +of breakage happens, but in this case it is quite simple. ::
> +
> +    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
> +    index 103667c368..374d593ead 100644
> +    --- a/hw/pci/pcie_aer.c
> +    +++ b/hw/pci/pcie_aer.c
> +    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
> +    uint16_t offset,
> +
> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> +                      PCI_ERR_UNC_SUPPORTED);
> +    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> +    -                 PCI_ERR_UNC_MASK_DEFAULT);
> +    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> +    -                 PCI_ERR_UNC_SUPPORTED);
> +    +
> +    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
> +    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> +    +                     PCI_ERR_UNC_MASK_DEFAULT);
> +    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> +    +                     PCI_ERR_UNC_SUPPORTED);
> +    +    }
> +
> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> +                      PCI_ERR_UNC_SEVERITY_DEFAULT);
> +
> +I.e. If the property bit is enabled, we configure it as we did for
> +qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
> +
> +And now, everything that is missing is disable the feature for old
> +machine types: ::
> +
> +    diff --git a/hw/core/machine.c b/hw/core/machine.c
> +    index 47a34841a5..07f763eb2e 100644
> +    --- a/hw/core/machine.c
> +    +++ b/hw/core/machine.c
> +    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
> +         { "e1000e", "migrate-timadj", "off" },
> +         { "virtio-mem", "x-early-migration", "false" },
> +         { "migration", "x-preempt-pre-7-2", "true" },
> +    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
> +     };
> +     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
> +
> +And now, when qemu-8.0.1 is released with this fix, all combinations
> +are going to work as supposed.
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> +
> +So the normality has been restaured and everything is ok, no?
> +
> +Not really, now our matrix is much bigger.  We started with the easy
> +cases, migration from the same version to the same version always
> +works:
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +
> +Now the interesting ones.  When the QEMU processes versions are
> +different.  For the 1st set, their fail and we can do nothing, both
> +versions are relased and we can't change anything.
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +This two are the ones that work. The whole point of making the
> +change in qemu-8.0.1 release was to fix this issue:
> +
> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> +
> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
> +qemu-8.0.1.
> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> +
> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
> +anything except to qemu-8.0.
> +
> +Can we do better?
> +
> +Yeap.  If we know that we are gonig to do this migration:
> +
> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> +
> +We can launche the appropiate devices with
> +
> +--device...,x-pci-e-err-unc-mask=on
> +
> +And now we can receive a migration from 8.0.  And from now on, we can
> +do that migration to new machine types if we remember to enable that
> +property for pc-7.2.  Notice that we need to remember, it is not
> +enough to know that the source of the migration is qemu-8.0.  Think of this example:
> +
> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
> +
> +In the second migration, the source is not qemu-8.0, but we still have
> +that "problem" and have that property enabled.  Notice that we need to
> +continue having this mark/property until we have this machine
> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now
> +on we can use the proper real machine.
> +
>  VMState
>  -------

Can we release this list of things that need to be configured
somewhere? Maybe in a sane format that libvirt can parse?



> -- 
> 2.40.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features
  2023-05-15  8:32 ` [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features Juan Quintela
  2023-05-16 23:51   ` Peter Xu
@ 2023-05-17 10:23   ` Michael S. Tsirkin
  2023-10-17 14:11     ` Juan Quintela
  1 sibling, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-05-17 10:23 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Leonardo Bras, Peter Xu, Jiri Denemark, Avihai Horon,
	Fiona Ebner, Daniel P . Berrangé

On Mon, May 15, 2023 at 10:32:00AM +0200, Juan Quintela wrote:
> Sometimes devices have different features depending of things outside
> of qemu.  For instance the kernel.  Document how to handle that cases.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>

yes, e.g. vhost features are exactly like this.


> ---
> 
> If you have some example to put here, I am all ears.  I guess that
> virtio-* with some features that are on qemu but not on all kernel
> would do the trick, but I am not a virtio guru myself.  Patches
> welcome.
> ---
>  docs/devel/migration.rst | 93 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 93 insertions(+)
> 
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index b4c4f3ec35..95e797ee60 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -357,6 +357,99 @@ machine types to have the right value: ::
>           ...
>       };
>  
> +A device with diferent features on both sides
> +---------------------------------------------
> +
> +Let's assume that we are using the same QEMU binary on both sides,
> +just to make the things easier.  But we have a device that has
> +different features on both sides of the migration.  That can be
> +because the devices are different, because the kernel driver of both
> +devices have different features, whatever.
> +
> +How can we get this to work with migration.  The way to do that is
> +"theoretically" easy.  You have to get the features that the device
> +has in the source of the migration.  The features that the device has
> +on the target of the migration, you get the intersection of the
> +features of both sides, and that is the way that you should launch
> +qemu.
> +
> +Notice that this is not completely related to qemu.  The most
> +important thing here is that this should be handle by the managing
> +application that launches qemu.  If qemu is configured correctly, the
> +migration will suceeed.
> +
> +Once that we have defined that, doing this is complicated.  Almost all
> +devices are bad at being able to be launched with only some features
> +enabled.  With one big exception: cpus.
> +
> +You can read the documentation for QEMU x86 cpu models here:
> +
> +https://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html
> +
> +See when they talk about migration they recommend that one chooses the
> +newest cpu model that is supported for all cpus.
> +
> +Let's say that we have:
> +
> +Host A:
> +
> +Device X has the feature Y
> +
> +Host B:
> +
> +Device X has not the feature Y
> +
> +If we try to migrate without any care from host A to host B, it will
> +fail because when migration tries to load the feature Y on
> +destination, it will find that the hardware is not there.
> +
> +Doing this would be the equivalent of doing with cpus:
> +
> +Host A:
> +
> +$ qemu-system-x86_64 -cpu host
> +
> +Host B:
> +
> +$ qemu-system-x86_64 -cpu host
> +
> +When both hosts have different cpu features this is waranteed to fail.
> +Especially if Host B has less features than host A.  If host A has
> +less features than host B, sometimes it works.  Important word of last
> +sentence is "sometimes".
> +
> +So, forgetting about cpu models and continuing with the -cpu host
> +example, let's see that the differences of the cpus is that Host A and
> +B have the following features:
> +
> +Features:   'pcid'  'stibp' 'taa-no'
> +Host A:        X       X
> +Host B:                        X
> +
> +And we want to migrate between them, the way configure both qemu cpu
> +will be:
> +
> +Host A:
> +
> +$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off
> +
> +Host B:
> +
> +$ qemu-system-x86_64 -cpu host,taa-no=off
> +
> +And you would be able to migrate between them.  It is responsability
> +of the management application or of the user to make sure that the
> +configuration is correct.  QEMU don't know how to look at this kind of
> +features in general.
> +
> +Other devices have worse control about individual features.  If they
> +want to be able to migrate between hosts that show different features,
> +the device needs a way to configure which ones it is going to use.
> +
> +In this section we have considered that we are using the same QEMU
> +binary in both sides of the migration.  If we use different QEMU
> +versions process, then we need to have into account all other
> +differences and the examples become even more complicated.

How do people know what to do?
How about a tool that will help you get data from hosts
and then tell you how to configure qemu to make them
compatible?


>  VMState
>  -------
> -- 
> 2.40.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-17 10:20   ` Michael S. Tsirkin
@ 2023-05-17 11:43     ` Juan Quintela
  2023-05-17 11:47       ` Michael S. Tsirkin
  2023-05-31 13:23       ` Michael S. Tsirkin
  0 siblings, 2 replies; 20+ messages in thread
From: Juan Quintela @ 2023-05-17 11:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Leonardo Bras, Peter Xu, Jiri Denemark, Avihai Horon,
	Fiona Ebner, Daniel P . Berrangé

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote:
>> When we detect that we have broken backwards compantibility in a
>> released version, we can't do anything for that version.  But once we
>> fix that bug on the next released version, we can "mitigate" that
>> problem when migrating to new versions to give a way out of that
>> machine until it does a hard reboot.
>> 
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>> ---
>>  docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 194 insertions(+)
>> 
>> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
>> index 95e797ee60..97b6f48474 100644
>> --- a/docs/devel/migration.rst
>> +++ b/docs/devel/migration.rst
>> @@ -451,6 +451,200 @@ binary in both sides of the migration.  If we use different QEMU
>>  versions process, then we need to have into account all other
>>  differences and the examples become even more complicated.
>>  
>> +How to mitigate when we have a backward compatibility error
>> +-----------------------------------------------------------
>> +
>> +We broke migration for old machine types continously during
>> +development.  But as soon as we find that there is a problem, we fix
>> +it.  The problem is what happens when we detect after we have done a
>> +release that something has gone wrong.
>> +
>> +Let see how it worked with one example.
>> +
>> +After the release of qemu-8.0 we found a problem when doing migration
>> +of the machine type pc-7.2.
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +
>> +  This migration works
>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +
>> +  This migration works
>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +
>> +  This migration fails
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +
>> +  This migration fails
>> +
>> +So clearly something fails when migration between qemu-7.2 and
>> +qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
>> +pointed to this commit.
>> +
>> +In qemu-8.0 we got this commit: ::
>> +
>> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
>> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> +    Date:   Thu Mar 2 13:37:03 2023 +0000
>> +
>> +        hw/pci/aer: Add missing routing for AER errors
>> +
>> +The relevant bits of the commit for our example are this ones:
>> +
>> +    --- a/hw/pci/pcie_aer.c
>> +    +++ b/hw/pci/pcie_aer.c
>> +    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
>> +
>> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
>> +                      PCI_ERR_UNC_SUPPORTED);
>> +    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
>> +    +                 PCI_ERR_UNC_MASK_DEFAULT);
>> +    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
>> +    +                 PCI_ERR_UNC_SUPPORTED);
>> +
>> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
>> +                     PCI_ERR_UNC_SEVERITY_DEFAULT);
>> +
>> +The patch changes how we configure pci space for AER.  But qemu fails
>> +when the pci space configuration is different betwwen source and
>> +destination.
>> +
>> +The following commit show how this got fixed:
>> +
>> +<put info of the commit once that it arrives upstream>
>> +
>> +The relevant parts of the fix are as follow:
>> +
>> +First, we create a new property for the device to be able to configure
>> +the old behaviour or the new behaviour. ::
>> +
>> +    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> +    index 8a87ccc8b0..5153ad63d6 100644
>> +    --- a/hw/pci/pci.c
>> +    +++ b/hw/pci/pci.c
>> +    @@ -79,6 +79,8 @@ static Property pci_props[] = {
>> +         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
>> +                            failover_pair_id),
>> +         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
>> +    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
>> +    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
>> +         DEFINE_PROP_END_OF_LIST()
>> +     };
>> +
>> +Notice that we enable te feature for new machine types.
>> +
>> +Now we see how the fix is done.  This is going to depend on what kind
>> +of breakage happens, but in this case it is quite simple. ::
>> +
>> +    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
>> +    index 103667c368..374d593ead 100644
>> +    --- a/hw/pci/pcie_aer.c
>> +    +++ b/hw/pci/pcie_aer.c
>> +    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
>> +    uint16_t offset,
>> +
>> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
>> +                      PCI_ERR_UNC_SUPPORTED);
>> +    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
>> +    -                 PCI_ERR_UNC_MASK_DEFAULT);
>> +    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
>> +    -                 PCI_ERR_UNC_SUPPORTED);
>> +    +
>> +    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
>> +    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
>> +    +                     PCI_ERR_UNC_MASK_DEFAULT);
>> +    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
>> +    +                     PCI_ERR_UNC_SUPPORTED);
>> +    +    }
>> +
>> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
>> +                      PCI_ERR_UNC_SEVERITY_DEFAULT);
>> +
>> +I.e. If the property bit is enabled, we configure it as we did for
>> +qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
>> +
>> +And now, everything that is missing is disable the feature for old
>> +machine types: ::
>> +
>> +    diff --git a/hw/core/machine.c b/hw/core/machine.c
>> +    index 47a34841a5..07f763eb2e 100644
>> +    --- a/hw/core/machine.c
>> +    +++ b/hw/core/machine.c
>> +    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
>> +         { "e1000e", "migrate-timadj", "off" },
>> +         { "virtio-mem", "x-early-migration", "false" },
>> +         { "migration", "x-preempt-pre-7-2", "true" },
>> +    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
>> +     };
>> +     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
>> +
>> +And now, when qemu-8.0.1 is released with this fix, all combinations
>> +are going to work as supposed.
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
>> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
>> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
>> +
>> +So the normality has been restaured and everything is ok, no?
>> +
>> +Not really, now our matrix is much bigger.  We started with the easy
>> +cases, migration from the same version to the same version always
>> +works:
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +
>> +Now the interesting ones.  When the QEMU processes versions are
>> +different.  For the 1st set, their fail and we can do nothing, both
>> +versions are relased and we can't change anything.
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +
>> +This two are the ones that work. The whole point of making the
>> +change in qemu-8.0.1 release was to fix this issue:
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +
>> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
>> +qemu-8.0.1.
>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +
>> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
>> +anything except to qemu-8.0.
>> +
>> +Can we do better?
>> +
>> +Yeap.  If we know that we are gonig to do this migration:
>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +
>> +We can launche the appropiate devices with
>> +
>> +--device...,x-pci-e-err-unc-mask=on
>> +
>> +And now we can receive a migration from 8.0.  And from now on, we can
>> +do that migration to new machine types if we remember to enable that
>> +property for pc-7.2.  Notice that we need to remember, it is not
>> +enough to know that the source of the migration is qemu-8.0.  Think of this example:
>> +
>> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
>> +
>> +In the second migration, the source is not qemu-8.0, but we still have
>> +that "problem" and have that property enabled.  Notice that we need to
>> +continue having this mark/property until we have this machine
>> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
>> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now
>> +on we can use the proper real machine.
>> +
>>  VMState
>>  -------
>
> Can we release this list of things that need to be configured
> somewhere? Maybe in a sane format that libvirt can parse?

What do you mean here?

the x-pci-e-err-unc-mask=on?

The most similar thing that we have is pc/machine.c:hw_compat_x_y.

But that also include the things where we have done the things right.

Daniel, Jiri, what would you need and what would be useful to you?

Later, Juan.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-17 11:43     ` Juan Quintela
@ 2023-05-17 11:47       ` Michael S. Tsirkin
  2023-05-31 13:23       ` Michael S. Tsirkin
  1 sibling, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-05-17 11:47 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Leonardo Bras, Peter Xu, Jiri Denemark, Avihai Horon,
	Fiona Ebner, Daniel P . Berrangé

On Wed, May 17, 2023 at 01:43:46PM +0200, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote:
> >> When we detect that we have broken backwards compantibility in a
> >> released version, we can't do anything for that version.  But once we
> >> fix that bug on the next released version, we can "mitigate" that
> >> problem when migrating to new versions to give a way out of that
> >> machine until it does a hard reboot.
> >> 
> >> Signed-off-by: Juan Quintela <quintela@redhat.com>
> >> ---
> >>  docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 194 insertions(+)
> >> 
> >> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> >> index 95e797ee60..97b6f48474 100644
> >> --- a/docs/devel/migration.rst
> >> +++ b/docs/devel/migration.rst
> >> @@ -451,6 +451,200 @@ binary in both sides of the migration.  If we use different QEMU
> >>  versions process, then we need to have into account all other
> >>  differences and the examples become even more complicated.
> >>  
> >> +How to mitigate when we have a backward compatibility error
> >> +-----------------------------------------------------------
> >> +
> >> +We broke migration for old machine types continously during
> >> +development.  But as soon as we find that there is a problem, we fix
> >> +it.  The problem is what happens when we detect after we have done a
> >> +release that something has gone wrong.
> >> +
> >> +Let see how it worked with one example.
> >> +
> >> +After the release of qemu-8.0 we found a problem when doing migration
> >> +of the machine type pc-7.2.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +  This migration works
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +  This migration works
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +  This migration fails
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +  This migration fails
> >> +
> >> +So clearly something fails when migration between qemu-7.2 and
> >> +qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
> >> +pointed to this commit.
> >> +
> >> +In qemu-8.0 we got this commit: ::
> >> +
> >> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
> >> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> +    Date:   Thu Mar 2 13:37:03 2023 +0000
> >> +
> >> +        hw/pci/aer: Add missing routing for AER errors
> >> +
> >> +The relevant bits of the commit for our example are this ones:
> >> +
> >> +    --- a/hw/pci/pcie_aer.c
> >> +    +++ b/hw/pci/pcie_aer.c
> >> +    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
> >> +
> >> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> >> +                      PCI_ERR_UNC_SUPPORTED);
> >> +    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                 PCI_ERR_UNC_MASK_DEFAULT);
> >> +    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                 PCI_ERR_UNC_SUPPORTED);
> >> +
> >> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> >> +                     PCI_ERR_UNC_SEVERITY_DEFAULT);
> >> +
> >> +The patch changes how we configure pci space for AER.  But qemu fails
> >> +when the pci space configuration is different betwwen source and
> >> +destination.
> >> +
> >> +The following commit show how this got fixed:
> >> +
> >> +<put info of the commit once that it arrives upstream>
> >> +
> >> +The relevant parts of the fix are as follow:
> >> +
> >> +First, we create a new property for the device to be able to configure
> >> +the old behaviour or the new behaviour. ::
> >> +
> >> +    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> +    index 8a87ccc8b0..5153ad63d6 100644
> >> +    --- a/hw/pci/pci.c
> >> +    +++ b/hw/pci/pci.c
> >> +    @@ -79,6 +79,8 @@ static Property pci_props[] = {
> >> +         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
> >> +                            failover_pair_id),
> >> +         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
> >> +    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
> >> +    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> >> +         DEFINE_PROP_END_OF_LIST()
> >> +     };
> >> +
> >> +Notice that we enable te feature for new machine types.
> >> +
> >> +Now we see how the fix is done.  This is going to depend on what kind
> >> +of breakage happens, but in this case it is quite simple. ::
> >> +
> >> +    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
> >> +    index 103667c368..374d593ead 100644
> >> +    --- a/hw/pci/pcie_aer.c
> >> +    +++ b/hw/pci/pcie_aer.c
> >> +    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
> >> +    uint16_t offset,
> >> +
> >> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> >> +                      PCI_ERR_UNC_SUPPORTED);
> >> +    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    -                 PCI_ERR_UNC_MASK_DEFAULT);
> >> +    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    -                 PCI_ERR_UNC_SUPPORTED);
> >> +    +
> >> +    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
> >> +    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                     PCI_ERR_UNC_MASK_DEFAULT);
> >> +    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                     PCI_ERR_UNC_SUPPORTED);
> >> +    +    }
> >> +
> >> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> >> +                      PCI_ERR_UNC_SEVERITY_DEFAULT);
> >> +
> >> +I.e. If the property bit is enabled, we configure it as we did for
> >> +qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
> >> +
> >> +And now, everything that is missing is disable the feature for old
> >> +machine types: ::
> >> +
> >> +    diff --git a/hw/core/machine.c b/hw/core/machine.c
> >> +    index 47a34841a5..07f763eb2e 100644
> >> +    --- a/hw/core/machine.c
> >> +    +++ b/hw/core/machine.c
> >> +    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
> >> +         { "e1000e", "migrate-timadj", "off" },
> >> +         { "virtio-mem", "x-early-migration", "false" },
> >> +         { "migration", "x-preempt-pre-7-2", "true" },
> >> +    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
> >> +     };
> >> +     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
> >> +
> >> +And now, when qemu-8.0.1 is released with this fix, all combinations
> >> +are going to work as supposed.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> >> +
> >> +So the normality has been restaured and everything is ok, no?
> >> +
> >> +Not really, now our matrix is much bigger.  We started with the easy
> >> +cases, migration from the same version to the same version always
> >> +works:
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +
> >> +Now the interesting ones.  When the QEMU processes versions are
> >> +different.  For the 1st set, their fail and we can do nothing, both
> >> +versions are relased and we can't change anything.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +This two are the ones that work. The whole point of making the
> >> +change in qemu-8.0.1 release was to fix this issue:
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
> >> +qemu-8.0.1.
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
> >> +anything except to qemu-8.0.
> >> +
> >> +Can we do better?
> >> +
> >> +Yeap.  If we know that we are gonig to do this migration:
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +
> >> +We can launche the appropiate devices with
> >> +
> >> +--device...,x-pci-e-err-unc-mask=on
> >> +
> >> +And now we can receive a migration from 8.0.  And from now on, we can
> >> +do that migration to new machine types if we remember to enable that
> >> +property for pc-7.2.  Notice that we need to remember, it is not
> >> +enough to know that the source of the migration is qemu-8.0.  Think of this example:
> >> +
> >> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
> >> +
> >> +In the second migration, the source is not qemu-8.0, but we still have
> >> +that "problem" and have that property enabled.  Notice that we need to
> >> +continue having this mark/property until we have this machine
> >> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
> >> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now
> >> +on we can use the proper real machine.
> >> +
> >>  VMState
> >>  -------
> >
> > Can we release this list of things that need to be configured
> > somewhere? Maybe in a sane format that libvirt can parse?
> 
> What do you mean here?
> 
> the x-pci-e-err-unc-mask=on?

Yes.

> The most similar thing that we have is pc/machine.c:hw_compat_x_y.
> 
> But that also include the things where we have done the things right.

Exactly.

> Daniel, Jiri, what would you need and what would be useful to you?
> 
> Later, Juan.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] migration: Add documentation for backwards compatiblity
  2023-05-16 23:39   ` Peter Xu
@ 2023-05-18  1:47     ` Xiaoyao Li
  2023-10-17 13:59       ` Juan Quintela
  2023-10-23 11:09     ` Juan Quintela
  1 sibling, 1 reply; 20+ messages in thread
From: Xiaoyao Li @ 2023-05-18  1:47 UTC (permalink / raw)
  To: Peter Xu, Juan Quintela
  Cc: qemu-devel, Michael S . Tsirkin, Leonardo Bras, Jiri Denemark,
	Avihai Horon, Fiona Ebner, Daniel P. Berrangé,
	Vladimir Sementsov-Ogievskiy

On 5/17/2023 7:39 AM, Peter Xu wrote:
> On Mon, May 15, 2023 at 10:31:59AM +0200, Juan Quintela wrote:
>> State what are the requeriments to get migration working between qemu
>> versions.  And once there explain how one is supposed to implement a
>> new feature/default value and not break migration.
>>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>> Message-Id: <20230511082701.12828-1-quintela@redhat.com>
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>> ---
>>   docs/devel/migration.rst | 216 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 216 insertions(+)
>>
>> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
>> index 6f65c23b47..b4c4f3ec35 100644
>> --- a/docs/devel/migration.rst
>> +++ b/docs/devel/migration.rst
>> @@ -142,6 +142,222 @@ General advice for device developers
>>     may be different on the destination.  This can result in the
>>     device state being loaded into the wrong device.
>>   
>> +How backwards compatibility works
>> +---------------------------------
>> +
>> +When we do migration, we have to QEMU process: the source and the
> 
> s/to/two/, s/process/processes/
> 
>> +target.  There are two cases, they are the same version or they are a
>> +different version.
> 
> s/a different version/different versions/
> 
>> +The easy case is when they are the same version.
>> +The difficult one is when they are different versions.
>> +
>> +There are two things that are different, but they have very similar
>> +names and sometimes get confused:
> 
> (space)
> 
>> +- QEMU version
>> +- machine version
> 
> It's normally called "machine type", so maybe use that?  Or just "machine
> version / machine type"?
> 
>> +
>> +Let's start with a practical example, we start with:
>> +
>> +- qemu-system-x86_64 (v5.2), from now on qemu-5.2.
>> +- qemu-system-x86_64 (v5.1), from now on qemu-5.1.
>> +
>> +Related to this are the "latest" machine types defined on each of
>> +them:
>> +
>> +- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
>> +- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1
>> +
>> +First of all, migration is only supposed to work if you use the same
>> +machine type in both source and destination. The QEMU hardware
>> +configuration needs to be the same also on source and destination.
>> +Most aspects of the backend configuration can be changed at will,
>> +except for a few cases where the backend features influence frontend
>> +device feature exposure.  But that is not relevant for this section.
>> +
>> +I am going to list the number of combinations that we can have.  Let's
>> +start with the trivial ones, QEMU is the same on source and
>> +destination:
>> +
>> +1 - qemu-5.2 -M pc-5.2  -> migrates to -> qemu-5.2 -M pc-5.2
>> +
>> +  This is the latest QEMU with the latest machine type.
>> +  This have to work, and if it doesn't work it is a bug.
>> +
>> +2 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
>> +
>> +  Exactly the same case than the previous one, but for 5.1.
>> +  Nothing to see here either.
>> +
>> +This are the easiest ones, we will not talk more about them in this
>> +section.
>> +
>> +Now we start with the more interesting cases.  Consider the case where
>> +we have the same QEMU version in both sides (qemu-5.2) but we are using

s/we are using/we are not

>> +the latest machine type for that version (pc-5.2) but one of an older
>> +QEMU version, in this case pc-5.1.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-17 11:43     ` Juan Quintela
  2023-05-17 11:47       ` Michael S. Tsirkin
@ 2023-05-31 13:23       ` Michael S. Tsirkin
  1 sibling, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2023-05-31 13:23 UTC (permalink / raw)
  To: Juan Quintela
  Cc: qemu-devel, Leonardo Bras, Peter Xu, Jiri Denemark, Avihai Horon,
	Fiona Ebner, Daniel P . Berrangé,
	Eric Blake, Laine Stump

On Wed, May 17, 2023 at 01:43:46PM +0200, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote:
> >> When we detect that we have broken backwards compantibility in a
> >> released version, we can't do anything for that version.  But once we
> >> fix that bug on the next released version, we can "mitigate" that
> >> problem when migrating to new versions to give a way out of that
> >> machine until it does a hard reboot.
> >> 
> >> Signed-off-by: Juan Quintela <quintela@redhat.com>
> >> ---
> >>  docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 194 insertions(+)
> >> 
> >> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> >> index 95e797ee60..97b6f48474 100644
> >> --- a/docs/devel/migration.rst
> >> +++ b/docs/devel/migration.rst
> >> @@ -451,6 +451,200 @@ binary in both sides of the migration.  If we use different QEMU
> >>  versions process, then we need to have into account all other
> >>  differences and the examples become even more complicated.
> >>  
> >> +How to mitigate when we have a backward compatibility error
> >> +-----------------------------------------------------------
> >> +
> >> +We broke migration for old machine types continously during
> >> +development.  But as soon as we find that there is a problem, we fix
> >> +it.  The problem is what happens when we detect after we have done a
> >> +release that something has gone wrong.
> >> +
> >> +Let see how it worked with one example.
> >> +
> >> +After the release of qemu-8.0 we found a problem when doing migration
> >> +of the machine type pc-7.2.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +  This migration works
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +  This migration works
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +  This migration fails
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +  This migration fails
> >> +
> >> +So clearly something fails when migration between qemu-7.2 and
> >> +qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
> >> +pointed to this commit.
> >> +
> >> +In qemu-8.0 we got this commit: ::
> >> +
> >> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
> >> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> +    Date:   Thu Mar 2 13:37:03 2023 +0000
> >> +
> >> +        hw/pci/aer: Add missing routing for AER errors
> >> +
> >> +The relevant bits of the commit for our example are this ones:
> >> +
> >> +    --- a/hw/pci/pcie_aer.c
> >> +    +++ b/hw/pci/pcie_aer.c
> >> +    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
> >> +
> >> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> >> +                      PCI_ERR_UNC_SUPPORTED);
> >> +    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                 PCI_ERR_UNC_MASK_DEFAULT);
> >> +    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                 PCI_ERR_UNC_SUPPORTED);
> >> +
> >> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> >> +                     PCI_ERR_UNC_SEVERITY_DEFAULT);
> >> +
> >> +The patch changes how we configure pci space for AER.  But qemu fails
> >> +when the pci space configuration is different betwwen source and
> >> +destination.
> >> +
> >> +The following commit show how this got fixed:
> >> +
> >> +<put info of the commit once that it arrives upstream>
> >> +
> >> +The relevant parts of the fix are as follow:
> >> +
> >> +First, we create a new property for the device to be able to configure
> >> +the old behaviour or the new behaviour. ::
> >> +
> >> +    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> +    index 8a87ccc8b0..5153ad63d6 100644
> >> +    --- a/hw/pci/pci.c
> >> +    +++ b/hw/pci/pci.c
> >> +    @@ -79,6 +79,8 @@ static Property pci_props[] = {
> >> +         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
> >> +                            failover_pair_id),
> >> +         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
> >> +    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
> >> +    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> >> +         DEFINE_PROP_END_OF_LIST()
> >> +     };
> >> +
> >> +Notice that we enable te feature for new machine types.
> >> +
> >> +Now we see how the fix is done.  This is going to depend on what kind
> >> +of breakage happens, but in this case it is quite simple. ::
> >> +
> >> +    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
> >> +    index 103667c368..374d593ead 100644
> >> +    --- a/hw/pci/pcie_aer.c
> >> +    +++ b/hw/pci/pcie_aer.c
> >> +    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
> >> +    uint16_t offset,
> >> +
> >> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> >> +                      PCI_ERR_UNC_SUPPORTED);
> >> +    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    -                 PCI_ERR_UNC_MASK_DEFAULT);
> >> +    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    -                 PCI_ERR_UNC_SUPPORTED);
> >> +    +
> >> +    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
> >> +    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                     PCI_ERR_UNC_MASK_DEFAULT);
> >> +    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                     PCI_ERR_UNC_SUPPORTED);
> >> +    +    }
> >> +
> >> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> >> +                      PCI_ERR_UNC_SEVERITY_DEFAULT);
> >> +
> >> +I.e. If the property bit is enabled, we configure it as we did for
> >> +qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
> >> +
> >> +And now, everything that is missing is disable the feature for old
> >> +machine types: ::
> >> +
> >> +    diff --git a/hw/core/machine.c b/hw/core/machine.c
> >> +    index 47a34841a5..07f763eb2e 100644
> >> +    --- a/hw/core/machine.c
> >> +    +++ b/hw/core/machine.c
> >> +    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
> >> +         { "e1000e", "migrate-timadj", "off" },
> >> +         { "virtio-mem", "x-early-migration", "false" },
> >> +         { "migration", "x-preempt-pre-7-2", "true" },
> >> +    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
> >> +     };
> >> +     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
> >> +
> >> +And now, when qemu-8.0.1 is released with this fix, all combinations
> >> +are going to work as supposed.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> >> +
> >> +So the normality has been restaured and everything is ok, no?
> >> +
> >> +Not really, now our matrix is much bigger.  We started with the easy
> >> +cases, migration from the same version to the same version always
> >> +works:
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +
> >> +Now the interesting ones.  When the QEMU processes versions are
> >> +different.  For the 1st set, their fail and we can do nothing, both
> >> +versions are relased and we can't change anything.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +This two are the ones that work. The whole point of making the
> >> +change in qemu-8.0.1 release was to fix this issue:
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
> >> +qemu-8.0.1.
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
> >> +anything except to qemu-8.0.
> >> +
> >> +Can we do better?
> >> +
> >> +Yeap.  If we know that we are gonig to do this migration:
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +
> >> +We can launche the appropiate devices with
> >> +
> >> +--device...,x-pci-e-err-unc-mask=on
> >> +
> >> +And now we can receive a migration from 8.0.  And from now on, we can
> >> +do that migration to new machine types if we remember to enable that
> >> +property for pc-7.2.  Notice that we need to remember, it is not
> >> +enough to know that the source of the migration is qemu-8.0.  Think of this example:
> >> +
> >> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
> >> +
> >> +In the second migration, the source is not qemu-8.0, but we still have
> >> +that "problem" and have that property enabled.  Notice that we need to
> >> +continue having this mark/property until we have this machine
> >> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
> >> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now
> >> +on we can use the proper real machine.
> >> +
> >>  VMState
> >>  -------
> >
> > Can we release this list of things that need to be configured
> > somewhere? Maybe in a sane format that libvirt can parse?
> 
> What do you mean here?
> 
> the x-pci-e-err-unc-mask=on?
> 
> The most similar thing that we have is pc/machine.c:hw_compat_x_y.
> 
> But that also include the things where we have done the things right.
> 
> Daniel, Jiri, what would you need and what would be useful to you?
> 
> Later, Juan.

Any input from anyone?  Cc a couple more people.



-- 
MST



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] migration: Add documentation for backwards compatiblity
  2023-05-18  1:47     ` Xiaoyao Li
@ 2023-10-17 13:59       ` Juan Quintela
  0 siblings, 0 replies; 20+ messages in thread
From: Juan Quintela @ 2023-10-17 13:59 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Peter Xu, qemu-devel, Michael S . Tsirkin, Leonardo Bras,
	Jiri Denemark, Avihai Horon, Fiona Ebner, Daniel P. Berrangé,
	Vladimir Sementsov-Ogievskiy

Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> On 5/17/2023 7:39 AM, Peter Xu wrote:
>> On Mon, May 15, 2023 at 10:31:59AM +0200, Juan Quintela wrote:
>>> +Now we start with the more interesting cases.  Consider the case where
>>> +we have the same QEMU version in both sides (qemu-5.2) but we are using
>
> s/we are using/we are not
>
>>> +the latest machine type for that version (pc-5.2) but one of an older
>>> +QEMU version, in this case pc-5.1.

Thanks.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features
  2023-05-16 23:51   ` Peter Xu
@ 2023-10-17 14:05     ` Juan Quintela
  0 siblings, 0 replies; 20+ messages in thread
From: Juan Quintela @ 2023-10-17 14:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Michael S . Tsirkin, Leonardo Bras, Jiri Denemark,
	Avihai Horon, Fiona Ebner, Daniel P . Berrangé

Peter Xu <peterx@redhat.com> wrote:
> On Mon, May 15, 2023 at 10:32:00AM +0200, Juan Quintela wrote:
>> +$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off
>> +
>> +Host B:
>> +
>> +$ qemu-system-x86_64 -cpu host,taa-no=off
>
> Since we're using cpu as example, shall we at least mention at the end that
> we don't suggest using -cpu host if migration is needed?

Added something like that.

>> +
>> +And you would be able to migrate between them.  It is responsability
>> +of the management application or of the user to make sure that the
>> +configuration is correct.  QEMU don't know how to look at this kind of
>> +features in general.
>> +
>> +Other devices have worse control about individual features.  If they
>> +want to be able to migrate between hosts that show different features,
>> +the device needs a way to configure which ones it is going to use.
>> +
>> +In this section we have considered that we are using the same QEMU
>> +binary in both sides of the migration.  If we use different QEMU
>> +versions process, then we need to have into account all other
>> +differences and the examples become even more complicated.
>
> Mostly good to me.  What I worry is how much help this will bring to
> developers - I'd assume developers working on these will be aware of this.
> But I guess it's always good to have any documentation than nothing.

I have two hopes here:

- when developer finds a problem with migration, they look there.

- having a document that I can point when one of this problems happens.
  What I do now is that I write an email, normally of worse quality and
  not finding a good example.

> Acked-by: Peter Xu <peterx@redhat.com>

Thanks, Juan.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features
  2023-05-17 10:23   ` Michael S. Tsirkin
@ 2023-10-17 14:11     ` Juan Quintela
  0 siblings, 0 replies; 20+ messages in thread
From: Juan Quintela @ 2023-10-17 14:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Leonardo Bras, Peter Xu, Jiri Denemark, Avihai Horon,
	Fiona Ebner, Daniel P . Berrangé

"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, May 15, 2023 at 10:32:00AM +0200, Juan Quintela wrote:
>> Sometimes devices have different features depending of things outside
>> of qemu.  For instance the kernel.  Document how to handle that cases.
>> 
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>
> yes, e.g. vhost features are exactly like this.

Yeap, I know that.  But not in enough detail to get a good example.

[...]

>> +In this section we have considered that we are using the same QEMU
>> +binary in both sides of the migration.  If we use different QEMU
>> +versions process, then we need to have into account all other
>> +differences and the examples become even more complicated.
>
> How do people know what to do?

They need to:
- know their hardware/device/driver.
- how they can do it in qemu.

I can help with the second, not with the 1st.

> How about a tool that will help you get data from hosts
> and then tell you how to configure qemu to make them
> compatible?

This is the holy gray of migration.  I would like to be able to create
machines from QMP.  That way, I can transport the configuration over the
migration channel, instead of hoping that it is the same.  Troubles so
far:

- we are very far away of being able to create machines with QMP (not
  migration related).

- we have properties that can be different on source and destination.
  For instance, the path to the file that implements a device can be
  different on both systems.

  We could add some keyword to the part of the configuration that can be
  different.  As said, we can say that destination/QMP can give us a new
  path for a block device.  But we need to be able to mark number of
  CPU's as required to be the same.

- For devices/state that can't be seen from inside qemu, I don't know
  have a good idea.  It can be related that I am not an expert on that
  type of devices.  Perhaps someone that knows more about they can give
  some insights.

Later, Juan.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-17  0:03   ` Peter Xu
@ 2023-10-17 14:18     ` Juan Quintela
  0 siblings, 0 replies; 20+ messages in thread
From: Juan Quintela @ 2023-10-17 14:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Michael S . Tsirkin, Leonardo Bras, Jiri Denemark,
	Avihai Horon, Fiona Ebner, Daniel P . Berrangé

Peter Xu <peterx@redhat.com> wrote:
> On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote:
>> When we detect that we have broken backwards compantibility in a
>> released version, we can't do anything for that version.  But once we
>> fix that bug on the next released version, we can "mitigate" that
>> problem when migrating to new versions to give a way out of that
>> machine until it does a hard reboot.
>> 
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>> ---
>>  docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 194 insertions(+)
>> 
>> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
>> index 95e797ee60..97b6f48474 100644
>> --- a/docs/devel/migration.rst
>> +++ b/docs/devel/migration.rst
>> @@ -451,6 +451,200 @@ binary in both sides of the migration.  If we use different QEMU
>>  versions process, then we need to have into account all other
>>  differences and the examples become even more complicated.
>>  
>> +How to mitigate when we have a backward compatibility error
>> +-----------------------------------------------------------
>> +
>> +We broke migration for old machine types continously during
>
> continuously

done.

>> +development.  But as soon as we find that there is a problem, we fix
>> +it.  The problem is what happens when we detect after we have done a
>> +release that something has gone wrong.
>> +
>> +Let see how it worked with one example.
>> +
>> +After the release of qemu-8.0 we found a problem when doing migration
>> +of the machine type pc-7.2.
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +
>> +  This migration works
>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +
>> +  This migration works
>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +
>> +  This migration fails
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +
>> +  This migration fails
>> +
>> +So clearly something fails when migration between qemu-7.2 and
>> +qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
>> +pointed to this commit.
>> +
>> +In qemu-8.0 we got this commit: ::
>> +
>> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
>> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> +    Date:   Thu Mar 2 13:37:03 2023 +0000
>> +
>> +        hw/pci/aer: Add missing routing for AER errors
>
> Worst timing ever for him.. :(

haha

> The lesson is never break migration when the maintainer has any intention
> to add some docs explaining backward compatibility.

Nah.  Just don't ever break migration and you are safe O:-)

>> +Notice that we enable te feature for new machine types.
>
> the

Done.

>> +                      PCI_ERR_UNC_SEVERITY_DEFAULT);
>> +
>> +I.e. If the property bit is enabled, we configure it as we did for
>> +qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
>> +
>> +And now, everything that is missing is disable the feature for old
>
> disabling

Done.

>> +Can we do better?
>> +
>> +Yeap.  If we know that we are gonig to do this migration:
>
> IIUC the other thing one should do is always keep their QEMU binaries
> uptodate.  E.g., anyone seriously using 8.0 released QEMU should always
> consider to do timely upgrade to e.g. 8.0.1 so this issue can also be
> avoided (by dropping 8.0 directly).

I think that advice is valid, but independently of migration.

>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +
>> +We can launche the appropiate devices with
>
> launch
>
>> +
>> +--device...,x-pci-e-err-unc-mask=on
>> +
>> +And now we can receive a migration from 8.0.  And from now on, we can
>> +do that migration to new machine types if we remember to enable that
>> +property for pc-7.2.  Notice that we need to remember, it is not
>> +enough to know that the source of the migration is qemu-8.0.  Think of this example:
>
> (wrap)
>
>> +
>> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
>> +
>> +In the second migration, the source is not qemu-8.0, but we still have
>> +that "problem" and have that property enabled.  Notice that we need to
>> +continue having this mark/property until we have this machine
>> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
>> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now
>> +on we can use the proper real machine.
>> +
>
> The 8.0.1 breaking migration to 8.0 is a very important point to mention
> indeed.

I hope it was clear on the example.

> Acked-by: Peter Xu <peterx@redhat.com>

Thanks, Juan.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] migration: Add documentation for backwards compatiblity
  2023-05-16 23:39   ` Peter Xu
  2023-05-18  1:47     ` Xiaoyao Li
@ 2023-10-23 11:09     ` Juan Quintela
  1 sibling, 0 replies; 20+ messages in thread
From: Juan Quintela @ 2023-10-23 11:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Michael S . Tsirkin, Leonardo Bras, Jiri Denemark,
	Avihai Horon, Fiona Ebner, Daniel P . Berrangé,
	Vladimir Sementsov-Ogievskiy

Peter Xu <peterx@redhat.com> wrote:
> On Mon, May 15, 2023 at 10:31:59AM +0200, Juan Quintela wrote:

[..]

>> +When we do migration, we have to QEMU process: the source and the
>
> s/to/two/, s/process/processes/

done.

>> +target.  There are two cases, they are the same version or they are a
>> +different version.
>
> s/a different version/different versions/

done

>> +The easy case is when they are the same version.
>> +The difficult one is when they are different versions.
>> +
>> +There are two things that are different, but they have very similar
>> +names and sometimes get confused:
>
> (space)
>
>> +- QEMU version
>> +- machine version
>
> It's normally called "machine type", so maybe use that?  Or just "machine
> version / machine type"?

machine type version

>
> This is definitely more detailed than I thought. :)
>
> Acked-by: Peter Xu <peterx@redhat.com>

Thanks, Juan.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
  2023-05-17  7:09   ` Fiona Ebner
@ 2023-10-23 11:09     ` Juan Quintela
  0 siblings, 0 replies; 20+ messages in thread
From: Juan Quintela @ 2023-10-23 11:09 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-devel, Michael S . Tsirkin, Leonardo Bras, Peter Xu,
	Jiri Denemark, Avihai Horon, Daniel P . Berrangé

Fiona Ebner <f.ebner@proxmox.com> wrote:
> Am 15.05.23 um 10:32 schrieb Juan Quintela:
>> When we detect that we have broken backwards compantibility in a
>
> compatibility

done

> (...)
>
>> +
>> +In qemu-8.0 we got this commit: ::
>> +
>> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
>> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> +    Date:   Thu Mar 2 13:37:03 2023 +0000
>> +
>> +        hw/pci/aer: Add missing routing for AER errors
>> +
>> +The relevant bits of the commit for our example are this ones:
>> +
>> +    --- a/hw/pci/pcie_aer.c
>> +    +++ b/hw/pci/pcie_aer.c
>> +    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
>> +
>> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
>> +                      PCI_ERR_UNC_SUPPORTED);
>> +    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
>> +    +                 PCI_ERR_UNC_MASK_DEFAULT);
>> +    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
>> +    +                 PCI_ERR_UNC_SUPPORTED);
>> +
>> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
>> +                     PCI_ERR_UNC_SEVERITY_DEFAULT);
>> +
>
> These changes are not part of commit
> 9a6ef182c0 ("hw/pci/aer: Add missing routing for AER errors")
> but rather the one before it, namely
> 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register")

grr, will fix that.


>> +The patch changes how we configure pci space for AER.  But qemu fails
>
> Should QEMU and PCI be capitalized in the text parts?

I think that I changed all 

>> +when the pci space configuration is different betwwen source and
>
> between

done.

>> +destination.
>> +
>> +The following commit show how this got fixed:
>
> shows

done

> (...)
>
>> +
>> +So the normality has been restaured and everything is ok, no?
>
> restored

done

>> +
>> +Not really, now our matrix is much bigger.  We started with the easy
>> +cases, migration from the same version to the same version always
>> +works:
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +
>> +Now the interesting ones.  When the QEMU processes versions are
>> +different.  For the 1st set, their fail and we can do nothing, both
>> +versions are relased and we can't change anything.
>
> released

done

>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +
>> +This two are the ones that work. The whole point of making the
>> +change in qemu-8.0.1 release was to fix this issue:
>> +
>> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
>> +
>> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
>> +qemu-8.0.1.
>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
>> +
>> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
>> +anything except to qemu-8.0.
>> +
>> +Can we do better?
>> +
>> +Yeap.  If we know that we are gonig to do this migration:
>
> going

done

>> +
>> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
>> +
>> +We can launche the appropiate devices with
>
> "launch" was already pointed out by Peter, but there's also "appropriate"
done

>> +
>> +--device...,x-pci-e-err-unc-mask=on
>> +
>> +And now we can receive a migration from 8.0.  And from now on, we can
>> +do that migration to new machine types if we remember to enable that
>> +property for pc-7.2.  Notice that we need to remember, it is not
>> +enough to know that the source of the migration is qemu-8.0.  Think of this example:
>> +
>> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
>> +
>> +In the second migration, the source is not qemu-8.0, but we still have
>> +that "problem" and have that property enabled.  Notice that we need to
>> +continue having this mark/property until we have this machine
>> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
>> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now
>
> machine

done

Thanks a lot.



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-10-23 11:09 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-15  8:31 [PATCH v3 0/3] Migration documentation Juan Quintela
2023-05-15  8:31 ` [PATCH v3 1/3] migration: Add documentation for backwards compatiblity Juan Quintela
2023-05-16 23:39   ` Peter Xu
2023-05-18  1:47     ` Xiaoyao Li
2023-10-17 13:59       ` Juan Quintela
2023-10-23 11:09     ` Juan Quintela
2023-05-15  8:32 ` [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features Juan Quintela
2023-05-16 23:51   ` Peter Xu
2023-10-17 14:05     ` Juan Quintela
2023-05-17 10:23   ` Michael S. Tsirkin
2023-10-17 14:11     ` Juan Quintela
2023-05-15  8:32 ` [PATCH v3 3/3] migration/doc: We broke backwards compatibility Juan Quintela
2023-05-17  0:03   ` Peter Xu
2023-10-17 14:18     ` Juan Quintela
2023-05-17  7:09   ` Fiona Ebner
2023-10-23 11:09     ` Juan Quintela
2023-05-17 10:20   ` Michael S. Tsirkin
2023-05-17 11:43     ` Juan Quintela
2023-05-17 11:47       ` Michael S. Tsirkin
2023-05-31 13:23       ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).