xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Xen-devel] [PATCH v5 0/2] docs: Migration design documents
@ 2020-02-13 10:53 Paul Durrant
  2020-02-13 10:53 ` [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration Paul Durrant
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Paul Durrant @ 2020-02-13 10:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Paul Durrant, Ian Jackson,
	Jan Beulich

Paul Durrant (2):
  docs/designs: Add a design document for non-cooperative live migration
  docs/designs: Add a design document for migration of xenstore data

 docs/designs/non-cooperative-migration.md | 272 ++++++++++++++++++++++
 docs/designs/xenstore-migration.md        | 136 +++++++++++
 2 files changed, 408 insertions(+)
 create mode 100644 docs/designs/non-cooperative-migration.md
 create mode 100644 docs/designs/xenstore-migration.md
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wl@xen.org>
-- 
2.20.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration
  2020-02-13 10:53 [Xen-devel] [PATCH v5 0/2] docs: Migration design documents Paul Durrant
@ 2020-02-13 10:53 ` Paul Durrant
  2020-03-04 15:10   ` Julien Grall
  2020-02-13 10:53 ` [Xen-devel] [PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore data Paul Durrant
  2020-02-20 12:54 ` [Xen-devel] [PATCH v5 0/2] docs: Migration design documents Durrant, Paul
  2 siblings, 1 reply; 11+ messages in thread
From: Paul Durrant @ 2020-02-13 10:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Paul Durrant, Ian Jackson,
	Jan Beulich

It has become apparent to some large cloud providers that the current
model of cooperative migration of guests under Xen is not usable as it
relies on software running inside the guest, which is likely beyond the
provider's control.
This patch introduces a proposal for non-cooperative live migration,
designed not to rely on any guest-side software.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wl@xen.org>

v5:
 - Note that PV domain are not just expected to co-operate, they are
   required to

v4:
 - Fix issues raised by Wei

v2:
 - Use the term 'non-cooperative' instead of 'transparent'
 - Replace 'trust in' with 'reliance on' when referring to guest-side
   software
---
 docs/designs/non-cooperative-migration.md | 272 ++++++++++++++++++++++
 1 file changed, 272 insertions(+)
 create mode 100644 docs/designs/non-cooperative-migration.md

diff --git a/docs/designs/non-cooperative-migration.md b/docs/designs/non-cooperative-migration.md
new file mode 100644
index 0000000000..09f74c8c0d
--- /dev/null
+++ b/docs/designs/non-cooperative-migration.md
@@ -0,0 +1,272 @@
+# Non-Cooperative Migration of Guests on Xen
+
+## Background
+
+The normal model of migration in Xen is driven by the guest because it was
+originally implemented for PV guests, where the guest must be aware it is
+running under Xen and is hence expected to co-operate. This model dates from
+an era when it was assumed that the host administrator had control of at least
+the privileged software running in the guest (i.e. the guest kernel) which may
+still be true in an enterprise deployment but is not generally true in a cloud
+environment. The aim of this design is to provide a model which is purely host
+driven, requiring no co-operation from the software running in the
+guest, and is thus suitable for cloud scenarios.
+
+PV guests are out of scope for this project because, as is outlined above, they
+have a symbiotic relationship with the hypervisor and therefore a certain level
+of co-operation is required.
+
+HVM guests can already be migrated on Xen without guest co-operation but only
+if they don’t have PV drivers installed[1] or are in power state S3. The
+reason for not expecting co-operation if the guest is in S3 is obvious, but the
+reason co-operation is expected if PV drivers are installed is due to the
+nature of PV protocols.
+
+## Xenstore Nodes and Domain ID
+
+The PV driver model consists of a *frontend* and a *backend*. The frontend runs
+inside the guest domain and the backend runs inside a *service domain* which
+may or may not be domain 0. The frontend and backend typically pass data via
+memory pages which are shared between the two domains, but this channel of
+communication is generally established using xenstore (the store protocol
+itself being an exception to this for obvious chicken-and-egg reasons).
+
+Typical protocol establishment is based on use of two separate xenstore
+*areas*. If we consider PV drivers for the *netif* protocol (i.e. class vif)
+and assume the guest has domid X, the service domain has domid Y, and the vif
+has index Z then the frontend area will reside under the parent node:
+
+`/local/domain/Y/device/vif/Z`
+
+All backends, by convention, typically reside under parent node:
+
+`/local/domain/X/backend`
+
+and the normal backend area for vif Z would be:
+
+`/local/domain/X/backend/vif/Y/Z`
+
+but this should not be assumed.
+
+The toolstack will place two nodes in the frontend area to explicitly locate
+the backend:
+
+    * `backend`: the fully qualified xenstore path of the backend area
+    * `backend-id`: the domid of the service domain
+
+and similarly two nodes in the backend area to locate the frontend area:
+
+    * `frontend`: the fully qualified xenstore path of the frontend area
+    * `frontend-id`: the domid of the guest domain
+
+
+The guest domain only has write permission to the frontend area and similarly
+the service domain only has write permission to the backend area, but both ends
+have read permission to both areas.
+
+Under both frontend and backend areas is a node called *state*. This is key to
+protocol establishment. Upon PV device creation the toolstack will set the
+value of both state nodes to 1 (XenbusStateInitialising[2]). This should cause
+enumeration of appropriate devices in both the guest and service domains. The
+backend device, once it has written any necessary protocol specific information
+into the xenstore backend area (to be read by the frontend driver) will update
+the backend state node to 2 (XenbusStateInitWait). From this point on PV
+protocols differ slightly; the following illustration is true of the netif
+protocol.
+
+Upon seeing a backend state value of 2, the frontend driver will then read the
+protocol specific information, write details of grant references (for shared
+pages) and event channel ports (for signalling) that it has created, and set
+the state node in the frontend area to 4 (XenbusStateConnected). Upon see this
+frontend state, the backend driver will then read the grant references (mapping
+the shared pages) and event channel ports (opening its end of them) and set the
+state node in the backend area to 4. Protocol establishment is now complete and
+the frontend and backend start to pass data.
+
+Because the domid of both ends of a PV protocol forms a key part of negotiating
+the data plane for that protocol (because it is encoded into both xenstore
+nodes and node paths), and because guest’s own domid and the domid of the
+service domain are visible to the guest in xenstore (and hence may cached
+internally), and neither are necessarily preserved during migration, it is
+hence necessary to have the co-operation of the frontend in re-negotiating the
+protocol using the new domid after migration.
+
+Moreover the backend-id value will be used by the frontend driver in setting up
+grant table entries and event channels to communicate with the service domain,
+so the co-operation of the guest is required to re-establish these in the new
+host environment after migration.
+
+Thus if we are to change the model and support migration of a guest with PV
+drivers, without the co-operation of the frontend driver code, the paths and
+values in both the frontend and backend xenstore areas must remain unchanged
+and valid in the new host environment, and the grant table entries and event
+channels must be preserved (and remain operational once guest execution is
+resumed).
+
+Because the service domain’s domid is used directly by the guest in setting
+up grant entries and event channels, the backend drivers in the new host
+environment must be provided by service domain with the same domid. Also,
+because the guest can sample its own domid from the frontend area and use it in
+hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest domid must
+also be preserved to maintain the ABI.
+
+Furthermore, it will necessary to modify backend drivers to re-establish
+communication with frontend drivers without perturbing the content of the
+backend area or requiring any changes to the values of the xenstore state nodes.
+
+## Other Para-Virtual State
+
+### Shared Rings
+
+Because the console and store protocol shared pages are actually part of the
+guest memory image (in an E820 reserved region just below 4G) then the content
+will get migrated as part of the guest memory image. Hence no additional code
+is require to prevent any guest visible change in the content.
+
+### Shared Info
+
+There is already a record defined in *libxenctrl Domain Image Format* [3]
+called `SHARED_INFO` which simply contains a complete copy of the domain’s
+shared info page. It is not currently incuded in an HVM (type `0x0002`)
+migration stream. It may be feasible to include it as an optional record
+but it is not clear that the content of the shared info page ever needs
+to be preserved for an HVM guest.
+
+For a PV guest the `arch_shared_info` sub-structure contains important
+information about the guest’s P2M, but this information is not relevant for
+an HVM guest where the P2M is not directly manipulated via the guest. The other
+state contained in the `shared_info` structure relates the domain wall-clock
+(the state of which should already be transferred by the `RTC` HVM context
+information which contained in the `HVM_CONTEXT` save record) and some event
+channel state (particularly if using the *2l* protocol). Event channel state
+will need to be fully transferred if we are not going to require the guest
+co-operation to re-open the channels and so it should be possible to re-build a
+shared info page for an HVM guest from such other state.
+
+Note that the shared info page also contains an array of `XEN_LEGACY_MAX_VCPUS`
+(32) `vcpu_info` structures. A domain may nominate a different guest physical
+address to use for the vcpu info. This is mandatory for if a domain wants to
+use more than 32 vCPUs and optional for legacy vCPUs. This mapping is not
+currently transferred in the migration state so this will either need to be
+added into an existing save record, or an additional type of save record will
+be needed.
+
+### Xenstore Watches
+
+As mentioned above, no domain Xenstore state is currently transferred in the
+migration stream. There is a record defined in *libxenlight Domain Image
+Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore nodes
+relating to emulators but no record type is defined for nodes relating to the
+domain itself, nor for registered *watches*. A XenStore watch is a mechanism
+used by PV frontend and backend drivers to request a notification if the value
+of a particular node (e.g. the other end’s state node) changes, so it is
+important that watches continue to function after a migration. One or more new
+save records will therefore be required to transfer Xenstore state. It will
+also be necessary to extend the *store* protocol[5] with mechanisms to allow
+the toolstack to acquire the list of watches that the guest has registered and
+for the toolstack to register a watch on behalf of a domain.
+
+### Event channels
+
+Event channels are essentially the para-virtual equivalent of interrupts. They
+are an important part of post PV protocols. Normally a frontend driver creates
+an *inter-domain* event channel between its own domain and the domain running
+the backend, which it discovers using the `backend-id` node in Xenstore (see
+above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall
+allocates an event channel object in the hypervisor and assigns a *local port*
+number which is then written into the frontend area in Xenstore. The backend
+driver then reads this port number and *binds* to the event channel by
+specifying it, and the value of `frontend-id`, as *remote domain* and *remote
+port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once
+connection is established in this fashion frontend and backend drivers can use
+the event channel as a *mailbox* to notify each other when a shared ring has
+been updated with new requests or response structures.
+
+Currently no event channel state is preserved on migration, requiring frontend
+and backend drivers to create and bind a complete new set of event channels in
+order to re-establish a protocol connection. Hence, one or more new save
+records will be required to transfer event channel state in order to avoid the
+need for explicit action by frontend drivers running in the guest. Note that
+the local port numbers need to preserved in this state as they are the only
+context the guest has to refer to the hypervisor event channel objects.
+ Note also that the PV *store* (Xenstore access) and *console* protocols also
+rely on event channels which are set up by the toolstack. Normally, early in
+migration, the toolstack running on the remote host would set up a new pair of
+event channels for these protocols in the destination domain. These may not be
+assigned the same local port numbers as the protocols running in the source
+domain. For non-cooperative migration these channels must either be created with
+fixed port numbers, or their creation must be avoided and instead be included
+in the general event channel state record(s).
+
+### Grant table
+
+The grant table is essentially the para-virtual equivalent of an IOMMU. For
+example, the shared rings of a PV protocol are *granted* by a frontend driver
+to the backend driver by allocating *grant entries* in the guest’s table,
+filling in details of the memory pages and then writing the *grant references*
+(the index values of the grant entries) into Xenstore. The grant references of
+the protocol buffers themselves are typically written directly into the request
+structures passed via a shared ring.
+
+The guest is responsible for managing its own grant table. No hypercall is
+required to grant a memory page to another domain. It is sufficient to find an
+unused grant entry and set bits in the entry to give read and/or write access
+to a remote domain also specified in the entry along with the page frame
+number. Thus the layout and content of the grant table logically forms part of
+the guest state.
+
+Currently no grant table state is migrated, requiring a guest to separately
+maintain any state that it wishes to persist elsewhere in its memory image and
+then restore it after migration. Thus to avoid the need for such explicit
+action by the guest, one or more new save records will be required to migrate
+the contents of the grant table.
+
+# Outline Proposal
+
+* PV backend drivers will be modified to unilaterally re-establish connection
+to a frontend if the backend state node is restored with value 4
+(XenbusStateConnected)[6].
+
+* The toolstack should be modified to allow domid to be randomized on initial
+creation or default migration, but make it identical to the source domain on
+non-cooperative migration. Non-Cooperative migration will have to be denied if the
+domid is unavailable on the target host, but randomization of domid on creation
+should hopefully minimize the likelihood of this. Non-Cooperative migration to
+localhost will clearly not be possible. Patches have already been sent to
+`xen-devel` to make this change[7].
+
+* `xenstored` should be modified to implement the new mechanisms needed. See
+*Other Para-Virtual State* above. A further design document will propose
+additional protocol messages.
+
+* Within the migration stream extra save records will be defined as required.
+See *Other Para-Virtual State* above. A further design document will propose
+modifications to the libxenlight and libxenctrl Domain Image Formats.
+
+* An option should be added to the toolstack to initiate a non-cooperative
+migration, instead of the (default) potentially co-operative migration.
+Essentially this should skip the check to see if PV drivers and migrate as if
+there are none present, but also enabling the extra save records. Note that at
+least some of the extra records should only form part of a non-cooperative
+migration stream. For example, migrating event channel state would be counter
+productive in a normal migration as this will essentially leak event channel
+objects at the receiving end. Others, such as grant table state, could
+potentially harmlessly form part of a normal migration stream.
+
+* * *
+[1] PV drivers are deemed to be installed if the HVM parameter
+*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value.
+
+[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/xenbus.h
+
+[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-migration-stream.pandoc
+
+[4] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
+
+[5] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
+
+[6] `xen-blkback` and `xen-netback` have already been modified in Linux to do
+this.
+
+[7] See https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00632.html
+
-- 
2.20.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Xen-devel] [PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore data
  2020-02-13 10:53 [Xen-devel] [PATCH v5 0/2] docs: Migration design documents Paul Durrant
  2020-02-13 10:53 ` [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration Paul Durrant
@ 2020-02-13 10:53 ` Paul Durrant
  2020-03-04 18:31   ` Julien Grall
  2020-02-20 12:54 ` [Xen-devel] [PATCH v5 0/2] docs: Migration design documents Durrant, Paul
  2 siblings, 1 reply; 11+ messages in thread
From: Paul Durrant @ 2020-02-13 10:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Paul Durrant, Ian Jackson,
	Jan Beulich

This patch details proposes extra migration data and xenstore protocol
extensions to support non-cooperative live migration of guests.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wl@xen.org>

v5:
 - Add QUIESCE
 - Make semantics of <index> in GET_DOMAIN_WATCHES more clear

v4:
 - Drop the restrictions on special paths

v3:
 - New in v3
---
 docs/designs/xenstore-migration.md | 136 +++++++++++++++++++++++++++++
 1 file changed, 136 insertions(+)
 create mode 100644 docs/designs/xenstore-migration.md

diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md
new file mode 100644
index 0000000000..5cfe2d9a7d
--- /dev/null
+++ b/docs/designs/xenstore-migration.md
@@ -0,0 +1,136 @@
+# Xenstore Migration
+
+## Background
+
+The design for *Non-Cooperative Migration of Guests*[1] explains that extra
+save records are required in the migrations stream to allow a guest running
+PV drivers to be migrated without its co-operation. Moreover the save
+records must include details of registered xenstore watches as well as
+content; information that cannot currently be recovered from `xenstored`,
+and hence some extension to the xenstore protocol[2] will also be required.
+
+The *libxenlight Domain Image Format* specification[3] already defines a
+record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
+transferring xenstore data pertaining to the domain directly as it is
+specified such that keys are relative to the path
+`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
+define at least one new save record type.
+
+## Proposal
+
+### New Save Record
+
+A new mandatory record type should be defined within the libxenlight Domain
+Image Format:
+
+`0x00000007: DOMAIN_XENSTORE_DATA`
+
+The format of each of these new records should be as follows:
+
+
+```
+0     1     2     3     4     5     6     7 octet
++------------------------+------------------------+
+| type                   | record specific data   |
++------------------------+                        |
+...
++-------------------------------------------------+
+```
+
+
+| Field | Description |
+|---|---|
+| `type` | 0x00000000: invalid |
+|        | 0x00000001: node data |
+|        | 0x00000002: watch data |
+|        | 0x00000003 - 0xFFFFFFFF: reserved for future use |
+
+
+where data is always in the form of a NUL separated and terminated tuple
+as follows
+
+
+**node data**
+
+
+`<path>|<value>|<perm-as-string>|`
+
+
+`<path>` is considered relative to the domain path `/local/domain/$domid`
+and hence must not begin with `/`.
+`<path>` and `<value>` should be suitable to formulate a `WRITE` operation
+to the receiving xenstore and `<perm-as-string>` should be similarly suitable
+to formulate a subsequent `SET_PERMS` operation.
+
+**watch data**
+
+
+`<path>|<token>|`
+
+`<path>` again is considered relative and, together with `<token>`, should
+be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below).
+
+
+### Protocol Extension
+
+Before xenstore state is migrated it is necessary to wait for any pending
+reads, writes, watch registrations etc. to complete, and also to make sure
+that xenstored does not start processing any new requests (so that new
+requests remain pending on the shared ring for subsequent processing on the
+new host). Hence the following operation is needed:
+
+```
+QUIESCE                 <domid>|
+
+Complete processing of any request issued by the specified domain, and
+do not process any further requests from the shared ring.
+```
+
+The `WATCH` operation does not allow specification of a `<domid>`; it is
+assumed that the watch pertains to the domain that owns the shared ring
+over which the operation is passed. Hence, for the tool-stack to be able
+to register a watch on behalf of a domain a new operation is needed:
+
+```
+ADD_DOMAIN_WATCHES      <domid>|<watch>|+
+
+Adds watches on behalf of the specified domain.
+
+<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
+operation are identical to the domain issuing WATCH <path>|<token>| for
+each <watch>.
+```
+
+The watch information for a domain also needs to be extracted from the
+sending xenstored so the following operation is also needed:
+
+```
+GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|* 
+
+Gets the list of watches that are currently registered for the domain.
+
+<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
+will start at <index> items into the the overall list of watches and may
+be truncated (at a <watch> boundary) such that the returned data fits
+within XENSTORE_PAYLOAD_MAX.
+
+If <index> is beyond the end of the overall list then the returned sub-
+list will be empty. If the value of <gencnt> changes then it indicates
+that the overall watch list has changed and thus it may be necessary
+to re-issue the operation for previous values of <index>.
+```
+
+It may also be desirable to state in the protocol specification that
+the `INTRODUCE` operation should not clear the `<mfn>` specified such that
+a `RELEASE` operation followed by an `INTRODUCE` operation form an
+idempotent pair. The current implementation of *C xentored* does this
+(in the `domain_conn_reset()` function) but this could be dropped as this
+behaviour is not currently specified and the page will always be zeroed
+for a newly created domain.
+
+
+* * *
+
+[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
+[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
+[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
-- 
2.20.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v5 0/2] docs: Migration design documents
  2020-02-13 10:53 [Xen-devel] [PATCH v5 0/2] docs: Migration design documents Paul Durrant
  2020-02-13 10:53 ` [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration Paul Durrant
  2020-02-13 10:53 ` [Xen-devel] [PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore data Paul Durrant
@ 2020-02-20 12:54 ` Durrant, Paul
  2020-02-28 17:20   ` Durrant, Paul
  2 siblings, 1 reply; 11+ messages in thread
From: Durrant, Paul @ 2020-02-20 12:54 UTC (permalink / raw)
  To: Durrant, Paul, xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich

Ping?

I have not receieved any further comments on v5. Can I please get acks or otherwise so we can (hopefully) move on with coding?

  Paul

> -----Original Message-----
> From: Paul Durrant <pdurrant@amazon.com>
> Sent: 13 February 2020 10:53
> To: xen-devel@lists.xenproject.org
> Cc: Durrant, Paul <pdurrant@amazon.co.uk>; Andrew Cooper
> <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>;
> Ian Jackson <ian.jackson@eu.citrix.com>; Jan Beulich <jbeulich@suse.com>;
> Julien Grall <julien@xen.org>; Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com>; Stefano Stabellini <sstabellini@kernel.org>; Wei
> Liu <wl@xen.org>
> Subject: [PATCH v5 0/2] docs: Migration design documents
> 
> Paul Durrant (2):
>   docs/designs: Add a design document for non-cooperative live migration
>   docs/designs: Add a design document for migration of xenstore data
> 
>  docs/designs/non-cooperative-migration.md | 272 ++++++++++++++++++++++
>  docs/designs/xenstore-migration.md        | 136 +++++++++++
>  2 files changed, 408 insertions(+)
>  create mode 100644 docs/designs/non-cooperative-migration.md
>  create mode 100644 docs/designs/xenstore-migration.md
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Julien Grall <julien@xen.org>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Wei Liu <wl@xen.org>
> --
> 2.20.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v5 0/2] docs: Migration design documents
  2020-02-20 12:54 ` [Xen-devel] [PATCH v5 0/2] docs: Migration design documents Durrant, Paul
@ 2020-02-28 17:20   ` Durrant, Paul
  0 siblings, 0 replies; 11+ messages in thread
From: Durrant, Paul @ 2020-02-28 17:20 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Julien Grall, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich

Ping again...

> -----Original Message-----
> From: Durrant, Paul <pdurrant@amazon.co.uk>
> Sent: 20 February 2020 12:54
> To: Durrant, Paul <pdurrant@amazon.co.uk>; xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap
> <George.Dunlap@eu.citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>;
> Jan Beulich <jbeulich@suse.com>; Julien Grall <julien@xen.org>; Konrad
> Rzeszutek Wilk <konrad.wilk@oracle.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Wei Liu <wl@xen.org>
> Subject: RE: [PATCH v5 0/2] docs: Migration design documents
> 
> Ping?
> 
> I have not receieved any further comments on v5. Can I please get acks or
> otherwise so we can (hopefully) move on with coding?
> 
>   Paul
> 
> > -----Original Message-----
> > From: Paul Durrant <pdurrant@amazon.com>
> > Sent: 13 February 2020 10:53
> > To: xen-devel@lists.xenproject.org
> > Cc: Durrant, Paul <pdurrant@amazon.co.uk>; Andrew Cooper
> > <andrew.cooper3@citrix.com>; George Dunlap
> <George.Dunlap@eu.citrix.com>;
> > Ian Jackson <ian.jackson@eu.citrix.com>; Jan Beulich
> <jbeulich@suse.com>;
> > Julien Grall <julien@xen.org>; Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com>; Stefano Stabellini <sstabellini@kernel.org>;
> Wei
> > Liu <wl@xen.org>
> > Subject: [PATCH v5 0/2] docs: Migration design documents
> >
> > Paul Durrant (2):
> >   docs/designs: Add a design document for non-cooperative live migration
> >   docs/designs: Add a design document for migration of xenstore data
> >
> >  docs/designs/non-cooperative-migration.md | 272 ++++++++++++++++++++++
> >  docs/designs/xenstore-migration.md        | 136 +++++++++++
> >  2 files changed, 408 insertions(+)
> >  create mode 100644 docs/designs/non-cooperative-migration.md
> >  create mode 100644 docs/designs/xenstore-migration.md
> > ---
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Julien Grall <julien@xen.org>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Wei Liu <wl@xen.org>
> > --
> > 2.20.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration
  2020-02-13 10:53 ` [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration Paul Durrant
@ 2020-03-04 15:10   ` Julien Grall
  2020-03-04 15:23     ` Durrant, Paul
  0 siblings, 1 reply; 11+ messages in thread
From: Julien Grall @ 2020-03-04 15:10 UTC (permalink / raw)
  To: Paul Durrant, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich

Hi Paul,

The proposal looks sensible to me. Some NITpicking below.

On 13/02/2020 10:53, Paul Durrant wrote:
> It has become apparent to some large cloud providers that the current
> model of cooperative migration of guests under Xen is not usable as it
> relies on software running inside the guest, which is likely beyond the
> provider's control.
> This patch introduces a proposal for non-cooperative live migration,
> designed not to rely on any guest-side software.
> 
> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Julien Grall <julien@xen.org>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Wei Liu <wl@xen.org>
> 
> v5:
>   - Note that PV domain are not just expected to co-operate, they are
>     required to
> 
> v4:
>   - Fix issues raised by Wei
> 
> v2:
>   - Use the term 'non-cooperative' instead of 'transparent'
>   - Replace 'trust in' with 'reliance on' when referring to guest-side
>     software
> ---
>   docs/designs/non-cooperative-migration.md | 272 ++++++++++++++++++++++
>   1 file changed, 272 insertions(+)
>   create mode 100644 docs/designs/non-cooperative-migration.md
> 
> diff --git a/docs/designs/non-cooperative-migration.md b/docs/designs/non-cooperative-migration.md
> new file mode 100644
> index 0000000000..09f74c8c0d
> --- /dev/null
> +++ b/docs/designs/non-cooperative-migration.md
> @@ -0,0 +1,272 @@
> +# Non-Cooperative Migration of Guests on Xen
> +
> +## Background
> +
> +The normal model of migration in Xen is driven by the guest because it was
> +originally implemented for PV guests, where the guest must be aware it is
> +running under Xen and is hence expected to co-operate. This model dates from
> +an era when it was assumed that the host administrator had control of at least
> +the privileged software running in the guest (i.e. the guest kernel) which may
> +still be true in an enterprise deployment but is not generally true in a cloud
> +environment. The aim of this design is to provide a model which is purely host
> +driven, requiring no co-operation from the software running in the
> +guest, and is thus suitable for cloud scenarios.
> +
> +PV guests are out of scope for this project because, as is outlined above, they
> +have a symbiotic relationship with the hypervisor and therefore a certain level
> +of co-operation is required.
> +
> +HVM guests can already be migrated on Xen without guest co-operation but only
> +if they don’t have PV drivers installed[1] or are in power state S3. The

S3 is very ACPI centric, so I would prefer if we avoid the term. I think 
the non-ACPI description is "suspend to RAM". I would be OK is you 
mention S3 in parenthesis.

> +reason for not expecting co-operation if the guest is in S3 is obvious, but the
> +reason co-operation is expected if PV drivers are installed is due to the
> +nature of PV protocols.
> +
> +## Xenstore Nodes and Domain ID
> +
> +The PV driver model consists of a *frontend* and a *backend*. The frontend runs
> +inside the guest domain and the backend runs inside a *service domain* which
> +may or may not be domain 0. The frontend and backend typically pass data via
> +memory pages which are shared between the two domains, but this channel of
> +communication is generally established using xenstore (the store protocol
> +itself being an exception to this for obvious chicken-and-egg reasons).
> +
> +Typical protocol establishment is based on use of two separate xenstore
> +*areas*. If we consider PV drivers for the *netif* protocol (i.e. class vif)
> +and assume the guest has domid X, the service domain has domid Y, and the vif
> +has index Z then the frontend area will reside under the parent node:
> +
> +`/local/domain/Y/device/vif/Z`
> +
> +All backends, by convention, typically reside under parent node:
> +
> +`/local/domain/X/backend`
> +
> +and the normal backend area for vif Z would be:
> +
> +`/local/domain/X/backend/vif/Y/Z`
> +
> +but this should not be assumed.
> +
> +The toolstack will place two nodes in the frontend area to explicitly locate
> +the backend:
> +
> +    * `backend`: the fully qualified xenstore path of the backend area
> +    * `backend-id`: the domid of the service domain
> +
> +and similarly two nodes in the backend area to locate the frontend area:
> +
> +    * `frontend`: the fully qualified xenstore path of the frontend area
> +    * `frontend-id`: the domid of the guest domain
> +
> +
> +The guest domain only has write permission to the frontend area and similarly
> +the service domain only has write permission to the backend area, but both ends
> +have read permission to both areas.
> +
> +Under both frontend and backend areas is a node called *state*. This is key to
> +protocol establishment. Upon PV device creation the toolstack will set the
> +value of both state nodes to 1 (XenbusStateInitialising[2]). This should cause
> +enumeration of appropriate devices in both the guest and service domains. The
> +backend device, once it has written any necessary protocol specific information
> +into the xenstore backend area (to be read by the frontend driver) will update
> +the backend state node to 2 (XenbusStateInitWait). From this point on PV
> +protocols differ slightly; the following illustration is true of the netif
> +protocol.
> +
> +Upon seeing a backend state value of 2, the frontend driver will then read the
> +protocol specific information, write details of grant references (for shared
> +pages) and event channel ports (for signalling) that it has created, and set
> +the state node in the frontend area to 4 (XenbusStateConnected). Upon see this
> +frontend state, the backend driver will then read the grant references (mapping
> +the shared pages) and event channel ports (opening its end of them) and set the
> +state node in the backend area to 4. Protocol establishment is now complete and
> +the frontend and backend start to pass data.
> +
> +Because the domid of both ends of a PV protocol forms a key part of negotiating
> +the data plane for that protocol (because it is encoded into both xenstore
> +nodes and node paths), and because guest’s own domid and the domid of the
> +service domain are visible to the guest in xenstore (and hence may cached
> +internally), and neither are necessarily preserved during migration, it is
> +hence necessary to have the co-operation of the frontend in re-negotiating the
> +protocol using the new domid after migration.
> +
> +Moreover the backend-id value will be used by the frontend driver in setting up
> +grant table entries and event channels to communicate with the service domain,
> +so the co-operation of the guest is required to re-establish these in the new
> +host environment after migration.
> +
> +Thus if we are to change the model and support migration of a guest with PV
> +drivers, without the co-operation of the frontend driver code, the paths and
> +values in both the frontend and backend xenstore areas must remain unchanged
> +and valid in the new host environment, and the grant table entries and event
> +channels must be preserved (and remain operational once guest execution is
> +resumed).
> +
> +Because the service domain’s domid is used directly by the guest in setting
> +up grant entries and event channels, the backend drivers in the new host
> +environment must be provided by service domain with the same domid. Also,
> +because the guest can sample its own domid from the frontend area and use it in
> +hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest domid must
> +also be preserved to maintain the ABI.
> +
> +Furthermore, it will necessary to modify backend drivers to re-establish
> +communication with frontend drivers without perturbing the content of the
> +backend area or requiring any changes to the values of the xenstore state nodes.
> +
> +## Other Para-Virtual State
> +
> +### Shared Rings
> +
> +Because the console and store protocol shared pages are actually part of the
> +guest memory image (in an E820 reserved region just below 4G) then the content

While Arm does not yet support migration, the concept of non-cooperative 
live migration is not x86 specific. I am OK with giving arch-specific 
example, but it should be clear on which architecture this is valid.

> +will get migrated as part of the guest memory image. Hence no additional code
> +is require to prevent any guest visible change in the content.
> +
> +### Shared Info
> +
> +There is already a record defined in *libxenctrl Domain Image Format* [3]
> +called `SHARED_INFO` which simply contains a complete copy of the domain’s
> +shared info page. It is not currently incuded in an HVM (type `0x0002`)
> +migration stream. It may be feasible to include it as an optional record
> +but it is not clear that the content of the shared info page ever needs
> +to be preserved for an HVM guest.
> +
> +For a PV guest the `arch_shared_info` sub-structure contains important
> +information about the guest’s P2M, but this information is not relevant for
> +an HVM guest where the P2M is not directly manipulated via the guest. The other
> +state contained in the `shared_info` structure relates the domain wall-clock
> +(the state of which should already be transferred by the `RTC` HVM context
> +information which contained in the `HVM_CONTEXT` save record) and some event
> +channel state (particularly if using the *2l* protocol). Event channel state
> +will need to be fully transferred if we are not going to require the guest
> +co-operation to re-open the channels and so it should be possible to re-build a
> +shared info page for an HVM guest from such other state.
> +
> +Note that the shared info page also contains an array of `XEN_LEGACY_MAX_VCPUS`
> +(32) `vcpu_info` structures. A domain may nominate a different guest physical
> +address to use for the vcpu info. This is mandatory for if a domain wants to
> +use more than 32 vCPUs and optional for legacy vCPUs. This mapping is not

Similar to above, those values are x86 specific. On Arm, only CPU0 is 
described in shared_info.

> +currently transferred in the migration state so this will either need to be
> +added into an existing save record, or an additional type of save record will
> +be needed.
> +
> +### Xenstore Watches
> +
> +As mentioned above, no domain Xenstore state is currently transferred in the
> +migration stream. There is a record defined in *libxenlight Domain Image
> +Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore nodes
> +relating to emulators but no record type is defined for nodes relating to the
> +domain itself, nor for registered *watches*. A XenStore watch is a mechanism
> +used by PV frontend and backend drivers to request a notification if the value
> +of a particular node (e.g. the other end’s state node) changes, so it is
> +important that watches continue to function after a migration. One or more new
> +save records will therefore be required to transfer Xenstore state. It will
> +also be necessary to extend the *store* protocol[5] with mechanisms to allow
> +the toolstack to acquire the list of watches that the guest has registered and
> +for the toolstack to register a watch on behalf of a domain.
> +
> +### Event channels
> +
> +Event channels are essentially the para-virtual equivalent of interrupts. They
> +are an important part of post PV protocols. Normally a frontend driver creates
> +an *inter-domain* event channel between its own domain and the domain running
> +the backend, which it discovers using the `backend-id` node in Xenstore (see
> +above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall
> +allocates an event channel object in the hypervisor and assigns a *local port*
> +number which is then written into the frontend area in Xenstore. The backend
> +driver then reads this port number and *binds* to the event channel by
> +specifying it, and the value of `frontend-id`, as *remote domain* and *remote
> +port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once
> +connection is established in this fashion frontend and backend drivers can use
> +the event channel as a *mailbox* to notify each other when a shared ring has
> +been updated with new requests or response structures.
> +
> +Currently no event channel state is preserved on migration, requiring frontend
> +and backend drivers to create and bind a complete new set of event channels in
> +order to re-establish a protocol connection. Hence, one or more new save
> +records will be required to transfer event channel state in order to avoid the
> +need for explicit action by frontend drivers running in the guest. Note that
> +the local port numbers need to preserved in this state as they are the only
> +context the guest has to refer to the hypervisor event channel objects.
> + Note also that the PV *store* (Xenstore access) and *console* protocols also
> +rely on event channels which are set up by the toolstack. Normally, early in
> +migration, the toolstack running on the remote host would set up a new pair of
> +event channels for these protocols in the destination domain. These may not be
> +assigned the same local port numbers as the protocols running in the source
> +domain. For non-cooperative migration these channels must either be created with
> +fixed port numbers, or their creation must be avoided and instead be included
> +in the general event channel state record(s).
> +
> +### Grant table
> +
> +The grant table is essentially the para-virtual equivalent of an IOMMU. For
> +example, the shared rings of a PV protocol are *granted* by a frontend driver
> +to the backend driver by allocating *grant entries* in the guest’s table,
> +filling in details of the memory pages and then writing the *grant references*
> +(the index values of the grant entries) into Xenstore. The grant references of
> +the protocol buffers themselves are typically written directly into the request
> +structures passed via a shared ring.
> +
> +The guest is responsible for managing its own grant table. No hypercall is
> +required to grant a memory page to another domain. It is sufficient to find an
> +unused grant entry and set bits in the entry to give read and/or write access
> +to a remote domain also specified in the entry along with the page frame
> +number. Thus the layout and content of the grant table logically forms part of
> +the guest state.
> +
> +Currently no grant table state is migrated, requiring a guest to separately
> +maintain any state that it wishes to persist elsewhere in its memory image and
> +then restore it after migration. Thus to avoid the need for such explicit
> +action by the guest, one or more new save records will be required to migrate
> +the contents of the grant table.
> +
> +# Outline Proposal
> +
> +* PV backend drivers will be modified to unilaterally re-establish connection
> +to a frontend if the backend state node is restored with value 4
> +(XenbusStateConnected)[6].
> +
> +* The toolstack should be modified to allow domid to be randomized on initial
> +creation or default migration, but make it identical to the source domain on
> +non-cooperative migration. Non-Cooperative migration will have to be denied if the
> +domid is unavailable on the target host, but randomization of domid on creation
> +should hopefully minimize the likelihood of this. Non-Cooperative migration to
> +localhost will clearly not be possible. Patches have already been sent to
> +`xen-devel` to make this change[7].

IIRC, the patch is merged now. You may want to update the last sentence.

> +
> +* `xenstored` should be modified to implement the new mechanisms needed. See
> +*Other Para-Virtual State* above. A further design document will propose
> +additional protocol messages.
> +
> +* Within the migration stream extra save records will be defined as required.
> +See *Other Para-Virtual State* above. A further design document will propose
> +modifications to the libxenlight and libxenctrl Domain Image Formats.
> +
> +* An option should be added to the toolstack to initiate a non-cooperative
> +migration, instead of the (default) potentially co-operative migration.
> +Essentially this should skip the check to see if PV drivers and migrate as if
> +there are none present, but also enabling the extra save records. Note that at
> +least some of the extra records should only form part of a non-cooperative
> +migration stream. For example, migrating event channel state would be counter
> +productive in a normal migration as this will essentially leak event channel
> +objects at the receiving end. Others, such as grant table state, could
> +potentially harmlessly form part of a normal migration stream.
> +
> +* * *
> +[1] PV drivers are deemed to be installed if the HVM parameter
> +*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value.
> +
> +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/xenbus.h
> +
> +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-migration-stream.pandoc
> +
> +[4] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
> +
> +[5] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
> +
> +[6] `xen-blkback` and `xen-netback` have already been modified in Linux to do
> +this.
> +
> +[7] See https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00632.html
> +
> 

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration
  2020-03-04 15:10   ` Julien Grall
@ 2020-03-04 15:23     ` Durrant, Paul
  2020-03-04 15:36       ` Julien Grall
  0 siblings, 1 reply; 11+ messages in thread
From: Durrant, Paul @ 2020-03-04 15:23 UTC (permalink / raw)
  To: 'Julien Grall', xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Julien Grall
> Sent: 04 March 2020 15:11
> To: Durrant, Paul <pdurrant@amazon.co.uk>; xen-devel@lists.xenproject.org
> Cc: Stefano Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Andrew Cooper
> <andrew.cooper3@citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>; Jan Beulich <jbeulich@suse.com>
> Subject: Re: [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live
> migration
> 
> Hi Paul,
> 
> The proposal looks sensible to me. Some NITpicking below.
> 
> On 13/02/2020 10:53, Paul Durrant wrote:
> > It has become apparent to some large cloud providers that the current
> > model of cooperative migration of guests under Xen is not usable as it
> > relies on software running inside the guest, which is likely beyond the
> > provider's control.
> > This patch introduces a proposal for non-cooperative live migration,
> > designed not to rely on any guest-side software.
> >
> > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> > ---
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Julien Grall <julien@xen.org>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Wei Liu <wl@xen.org>
> >
> > v5:
> >   - Note that PV domain are not just expected to co-operate, they are
> >     required to
> >
> > v4:
> >   - Fix issues raised by Wei
> >
> > v2:
> >   - Use the term 'non-cooperative' instead of 'transparent'
> >   - Replace 'trust in' with 'reliance on' when referring to guest-side
> >     software
> > ---
> >   docs/designs/non-cooperative-migration.md | 272 ++++++++++++++++++++++
> >   1 file changed, 272 insertions(+)
> >   create mode 100644 docs/designs/non-cooperative-migration.md
> >
> > diff --git a/docs/designs/non-cooperative-migration.md b/docs/designs/non-cooperative-migration.md
> > new file mode 100644
> > index 0000000000..09f74c8c0d
> > --- /dev/null
> > +++ b/docs/designs/non-cooperative-migration.md
> > @@ -0,0 +1,272 @@
> > +# Non-Cooperative Migration of Guests on Xen
> > +
> > +## Background
> > +
> > +The normal model of migration in Xen is driven by the guest because it was
> > +originally implemented for PV guests, where the guest must be aware it is
> > +running under Xen and is hence expected to co-operate. This model dates from
> > +an era when it was assumed that the host administrator had control of at least
> > +the privileged software running in the guest (i.e. the guest kernel) which may
> > +still be true in an enterprise deployment but is not generally true in a cloud
> > +environment. The aim of this design is to provide a model which is purely host
> > +driven, requiring no co-operation from the software running in the
> > +guest, and is thus suitable for cloud scenarios.
> > +
> > +PV guests are out of scope for this project because, as is outlined above, they
> > +have a symbiotic relationship with the hypervisor and therefore a certain level
> > +of co-operation is required.
> > +
> > +HVM guests can already be migrated on Xen without guest co-operation but only
> > +if they don’t have PV drivers installed[1] or are in power state S3. The
> 
> S3 is very ACPI centric, so I would prefer if we avoid the term. I think
> the non-ACPI description is "suspend to RAM". I would be OK is you
> mention S3 in parenthesis.

I'm actually pulling this from the way the code is currently written, which is clearly quite x86 specific:

xc_hvm_param_get(CTX->xch, domid, HVM_PARAM_ACPI_S_STATE, &hvm_s_state)
.
.
.
if (dsps->type == LIBXL_DOMAIN_TYPE_HVM && (!hvm_pvdrv || hvm_s_state)) {
    LOGD(DEBUG, domid, "Calling xc_domain_shutdown on HVM domain");
    ret = xc_domain_shutdown(CTX->xch, domid, SHUTDOWN_suspend);
    .
    .
}

So actually I should say 'not in power state S0'.

> 
> > +reason for not expecting co-operation if the guest is in S3 is obvious, but the
> > +reason co-operation is expected if PV drivers are installed is due to the
> > +nature of PV protocols.
> > +
> > +## Xenstore Nodes and Domain ID
> > +
> > +The PV driver model consists of a *frontend* and a *backend*. The frontend runs
> > +inside the guest domain and the backend runs inside a *service domain* which
> > +may or may not be domain 0. The frontend and backend typically pass data via
> > +memory pages which are shared between the two domains, but this channel of
> > +communication is generally established using xenstore (the store protocol
> > +itself being an exception to this for obvious chicken-and-egg reasons).
> > +
> > +Typical protocol establishment is based on use of two separate xenstore
> > +*areas*. If we consider PV drivers for the *netif* protocol (i.e. class vif)
> > +and assume the guest has domid X, the service domain has domid Y, and the vif
> > +has index Z then the frontend area will reside under the parent node:
> > +
> > +`/local/domain/Y/device/vif/Z`
> > +
> > +All backends, by convention, typically reside under parent node:
> > +
> > +`/local/domain/X/backend`
> > +
> > +and the normal backend area for vif Z would be:
> > +
> > +`/local/domain/X/backend/vif/Y/Z`
> > +
> > +but this should not be assumed.
> > +
> > +The toolstack will place two nodes in the frontend area to explicitly locate
> > +the backend:
> > +
> > +    * `backend`: the fully qualified xenstore path of the backend area
> > +    * `backend-id`: the domid of the service domain
> > +
> > +and similarly two nodes in the backend area to locate the frontend area:
> > +
> > +    * `frontend`: the fully qualified xenstore path of the frontend area
> > +    * `frontend-id`: the domid of the guest domain
> > +
> > +
> > +The guest domain only has write permission to the frontend area and similarly
> > +the service domain only has write permission to the backend area, but both ends
> > +have read permission to both areas.
> > +
> > +Under both frontend and backend areas is a node called *state*. This is key to
> > +protocol establishment. Upon PV device creation the toolstack will set the
> > +value of both state nodes to 1 (XenbusStateInitialising[2]). This should cause
> > +enumeration of appropriate devices in both the guest and service domains. The
> > +backend device, once it has written any necessary protocol specific information
> > +into the xenstore backend area (to be read by the frontend driver) will update
> > +the backend state node to 2 (XenbusStateInitWait). From this point on PV
> > +protocols differ slightly; the following illustration is true of the netif
> > +protocol.
> > +
> > +Upon seeing a backend state value of 2, the frontend driver will then read the
> > +protocol specific information, write details of grant references (for shared
> > +pages) and event channel ports (for signalling) that it has created, and set
> > +the state node in the frontend area to 4 (XenbusStateConnected). Upon see this
> > +frontend state, the backend driver will then read the grant references (mapping
> > +the shared pages) and event channel ports (opening its end of them) and set the
> > +state node in the backend area to 4. Protocol establishment is now complete and
> > +the frontend and backend start to pass data.
> > +
> > +Because the domid of both ends of a PV protocol forms a key part of negotiating
> > +the data plane for that protocol (because it is encoded into both xenstore
> > +nodes and node paths), and because guest’s own domid and the domid of the
> > +service domain are visible to the guest in xenstore (and hence may cached
> > +internally), and neither are necessarily preserved during migration, it is
> > +hence necessary to have the co-operation of the frontend in re-negotiating the
> > +protocol using the new domid after migration.
> > +
> > +Moreover the backend-id value will be used by the frontend driver in setting up
> > +grant table entries and event channels to communicate with the service domain,
> > +so the co-operation of the guest is required to re-establish these in the new
> > +host environment after migration.
> > +
> > +Thus if we are to change the model and support migration of a guest with PV
> > +drivers, without the co-operation of the frontend driver code, the paths and
> > +values in both the frontend and backend xenstore areas must remain unchanged
> > +and valid in the new host environment, and the grant table entries and event
> > +channels must be preserved (and remain operational once guest execution is
> > +resumed).
> > +
> > +Because the service domain’s domid is used directly by the guest in setting
> > +up grant entries and event channels, the backend drivers in the new host
> > +environment must be provided by service domain with the same domid. Also,
> > +because the guest can sample its own domid from the frontend area and use it in
> > +hypercalls (e.g. HVMOP_set_param) rather than DOMID_SELF, the guest domid must
> > +also be preserved to maintain the ABI.
> > +
> > +Furthermore, it will necessary to modify backend drivers to re-establish
> > +communication with frontend drivers without perturbing the content of the
> > +backend area or requiring any changes to the values of the xenstore state nodes.
> > +
> > +## Other Para-Virtual State
> > +
> > +### Shared Rings
> > +
> > +Because the console and store protocol shared pages are actually part of the
> > +guest memory image (in an E820 reserved region just below 4G) then the content
> 
> While Arm does not yet support migration, the concept of non-cooperative
> live migration is not x86 specific. I am OK with giving arch-specific
> example, but it should be clear on which architecture this is valid.
> 

Ok.

> > +will get migrated as part of the guest memory image. Hence no additional code
> > +is require to prevent any guest visible change in the content.
> > +
> > +### Shared Info
> > +
> > +There is already a record defined in *libxenctrl Domain Image Format* [3]
> > +called `SHARED_INFO` which simply contains a complete copy of the domain’s
> > +shared info page. It is not currently incuded in an HVM (type `0x0002`)
> > +migration stream. It may be feasible to include it as an optional record
> > +but it is not clear that the content of the shared info page ever needs
> > +to be preserved for an HVM guest.
> > +
> > +For a PV guest the `arch_shared_info` sub-structure contains important
> > +information about the guest’s P2M, but this information is not relevant for
> > +an HVM guest where the P2M is not directly manipulated via the guest. The other
> > +state contained in the `shared_info` structure relates the domain wall-clock
> > +(the state of which should already be transferred by the `RTC` HVM context
> > +information which contained in the `HVM_CONTEXT` save record) and some event
> > +channel state (particularly if using the *2l* protocol). Event channel state
> > +will need to be fully transferred if we are not going to require the guest
> > +co-operation to re-open the channels and so it should be possible to re-build a
> > +shared info page for an HVM guest from such other state.
> > +
> > +Note that the shared info page also contains an array of `XEN_LEGACY_MAX_VCPUS`
> > +(32) `vcpu_info` structures. A domain may nominate a different guest physical
> > +address to use for the vcpu info. This is mandatory for if a domain wants to
> > +use more than 32 vCPUs and optional for legacy vCPUs. This mapping is not
> 
> Similar to above, those values are x86 specific. On Arm, only CPU0 is
> described in shared_info.
> 

Ok.

> > +currently transferred in the migration state so this will either need to be
> > +added into an existing save record, or an additional type of save record will
> > +be needed.
> > +
> > +### Xenstore Watches
> > +
> > +As mentioned above, no domain Xenstore state is currently transferred in the
> > +migration stream. There is a record defined in *libxenlight Domain Image
> > +Format* [4] called `EMULATOR_XENSTORE_DATA` for transferring Xenstore nodes
> > +relating to emulators but no record type is defined for nodes relating to the
> > +domain itself, nor for registered *watches*. A XenStore watch is a mechanism
> > +used by PV frontend and backend drivers to request a notification if the value
> > +of a particular node (e.g. the other end’s state node) changes, so it is
> > +important that watches continue to function after a migration. One or more new
> > +save records will therefore be required to transfer Xenstore state. It will
> > +also be necessary to extend the *store* protocol[5] with mechanisms to allow
> > +the toolstack to acquire the list of watches that the guest has registered and
> > +for the toolstack to register a watch on behalf of a domain.
> > +
> > +### Event channels
> > +
> > +Event channels are essentially the para-virtual equivalent of interrupts. They
> > +are an important part of post PV protocols. Normally a frontend driver creates
> > +an *inter-domain* event channel between its own domain and the domain running
> > +the backend, which it discovers using the `backend-id` node in Xenstore (see
> > +above), by making a `EVTCHNOP_alloc_unbound` hypercall. This hypercall
> > +allocates an event channel object in the hypervisor and assigns a *local port*
> > +number which is then written into the frontend area in Xenstore. The backend
> > +driver then reads this port number and *binds* to the event channel by
> > +specifying it, and the value of `frontend-id`, as *remote domain* and *remote
> > +port* (respectively) to a `EVTCHNOP_bind_interdomain` hypercall. Once
> > +connection is established in this fashion frontend and backend drivers can use
> > +the event channel as a *mailbox* to notify each other when a shared ring has
> > +been updated with new requests or response structures.
> > +
> > +Currently no event channel state is preserved on migration, requiring frontend
> > +and backend drivers to create and bind a complete new set of event channels in
> > +order to re-establish a protocol connection. Hence, one or more new save
> > +records will be required to transfer event channel state in order to avoid the
> > +need for explicit action by frontend drivers running in the guest. Note that
> > +the local port numbers need to preserved in this state as they are the only
> > +context the guest has to refer to the hypervisor event channel objects.
> > + Note also that the PV *store* (Xenstore access) and *console* protocols also
> > +rely on event channels which are set up by the toolstack. Normally, early in
> > +migration, the toolstack running on the remote host would set up a new pair of
> > +event channels for these protocols in the destination domain. These may not be
> > +assigned the same local port numbers as the protocols running in the source
> > +domain. For non-cooperative migration these channels must either be created with
> > +fixed port numbers, or their creation must be avoided and instead be included
> > +in the general event channel state record(s).
> > +
> > +### Grant table
> > +
> > +The grant table is essentially the para-virtual equivalent of an IOMMU. For
> > +example, the shared rings of a PV protocol are *granted* by a frontend driver
> > +to the backend driver by allocating *grant entries* in the guest’s table,
> > +filling in details of the memory pages and then writing the *grant references*
> > +(the index values of the grant entries) into Xenstore. The grant references of
> > +the protocol buffers themselves are typically written directly into the request
> > +structures passed via a shared ring.
> > +
> > +The guest is responsible for managing its own grant table. No hypercall is
> > +required to grant a memory page to another domain. It is sufficient to find an
> > +unused grant entry and set bits in the entry to give read and/or write access
> > +to a remote domain also specified in the entry along with the page frame
> > +number. Thus the layout and content of the grant table logically forms part of
> > +the guest state.
> > +
> > +Currently no grant table state is migrated, requiring a guest to separately
> > +maintain any state that it wishes to persist elsewhere in its memory image and
> > +then restore it after migration. Thus to avoid the need for such explicit
> > +action by the guest, one or more new save records will be required to migrate
> > +the contents of the grant table.
> > +
> > +# Outline Proposal
> > +
> > +* PV backend drivers will be modified to unilaterally re-establish connection
> > +to a frontend if the backend state node is restored with value 4
> > +(XenbusStateConnected)[6].
> > +
> > +* The toolstack should be modified to allow domid to be randomized on initial
> > +creation or default migration, but make it identical to the source domain on
> > +non-cooperative migration. Non-Cooperative migration will have to be denied if the
> > +domid is unavailable on the target host, but randomization of domid on creation
> > +should hopefully minimize the likelihood of this. Non-Cooperative migration to
> > +localhost will clearly not be possible. Patches have already been sent to
> > +`xen-devel` to make this change[7].
> 
> IIRC, the patch is merged now. You may want to update the last sentence.
> 

It is, since this has been outstanding for such a long time :-/

I'll fix it up.

  Paul

> > +
> > +* `xenstored` should be modified to implement the new mechanisms needed. See
> > +*Other Para-Virtual State* above. A further design document will propose
> > +additional protocol messages.
> > +
> > +* Within the migration stream extra save records will be defined as required.
> > +See *Other Para-Virtual State* above. A further design document will propose
> > +modifications to the libxenlight and libxenctrl Domain Image Formats.
> > +
> > +* An option should be added to the toolstack to initiate a non-cooperative
> > +migration, instead of the (default) potentially co-operative migration.
> > +Essentially this should skip the check to see if PV drivers and migrate as if
> > +there are none present, but also enabling the extra save records. Note that at
> > +least some of the extra records should only form part of a non-cooperative
> > +migration stream. For example, migrating event channel state would be counter
> > +productive in a normal migration as this will essentially leak event channel
> > +objects at the receiving end. Others, such as grant table state, could
> > +potentially harmlessly form part of a normal migration stream.
> > +
> > +* * *
> > +[1] PV drivers are deemed to be installed if the HVM parameter
> > +*HVM_PARAM_CALLBACK_IRQ* has been set to a non-zero value.
> > +
> > +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/xenbus.h
> > +
> > +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxc-migration-stream.pandoc
> > +
> > +[4] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
> > +
> > +[5] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
> > +
> > +[6] `xen-blkback` and `xen-netback` have already been modified in Linux to do
> > +this.
> > +
> > +[7] See https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00632.html
> > +
> >
> 
> Cheers,
> 
> --
> Julien Grall
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration
  2020-03-04 15:23     ` Durrant, Paul
@ 2020-03-04 15:36       ` Julien Grall
  2020-03-04 16:03         ` Durrant, Paul
  0 siblings, 1 reply; 11+ messages in thread
From: Julien Grall @ 2020-03-04 15:36 UTC (permalink / raw)
  To: Durrant, Paul, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich

Hi Paul,

On 04/03/2020 15:23, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Julien Grall
>> Sent: 04 March 2020 15:11
>> To: Durrant, Paul <pdurrant@amazon.co.uk>; xen-devel@lists.xenproject.org
>> Cc: Stefano Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Andrew Cooper
>> <andrew.cooper3@citrix.com>; Ian Jackson <ian.jackson@eu.citrix.com>; Jan Beulich <jbeulich@suse.com>
>> Subject: Re: [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live
>> migration
>>
>> Hi Paul,
>>
>> The proposal looks sensible to me. Some NITpicking below.
>>
>> On 13/02/2020 10:53, Paul Durrant wrote:
>>> It has become apparent to some large cloud providers that the current
>>> model of cooperative migration of guests under Xen is not usable as it
>>> relies on software running inside the guest, which is likely beyond the
>>> provider's control.
>>> This patch introduces a proposal for non-cooperative live migration,
>>> designed not to rely on any guest-side software.
>>>
>>> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
>>> ---
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
>>> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
>>> Cc: Jan Beulich <jbeulich@suse.com>
>>> Cc: Julien Grall <julien@xen.org>
>>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>>> Cc: Wei Liu <wl@xen.org>
>>>
>>> v5:
>>>    - Note that PV domain are not just expected to co-operate, they are
>>>      required to
>>>
>>> v4:
>>>    - Fix issues raised by Wei
>>>
>>> v2:
>>>    - Use the term 'non-cooperative' instead of 'transparent'
>>>    - Replace 'trust in' with 'reliance on' when referring to guest-side
>>>      software
>>> ---
>>>    docs/designs/non-cooperative-migration.md | 272 ++++++++++++++++++++++
>>>    1 file changed, 272 insertions(+)
>>>    create mode 100644 docs/designs/non-cooperative-migration.md
>>>
>>> diff --git a/docs/designs/non-cooperative-migration.md b/docs/designs/non-cooperative-migration.md
>>> new file mode 100644
>>> index 0000000000..09f74c8c0d
>>> --- /dev/null
>>> +++ b/docs/designs/non-cooperative-migration.md
>>> @@ -0,0 +1,272 @@
>>> +# Non-Cooperative Migration of Guests on Xen
>>> +
>>> +## Background
>>> +
>>> +The normal model of migration in Xen is driven by the guest because it was
>>> +originally implemented for PV guests, where the guest must be aware it is
>>> +running under Xen and is hence expected to co-operate. This model dates from
>>> +an era when it was assumed that the host administrator had control of at least
>>> +the privileged software running in the guest (i.e. the guest kernel) which may
>>> +still be true in an enterprise deployment but is not generally true in a cloud
>>> +environment. The aim of this design is to provide a model which is purely host
>>> +driven, requiring no co-operation from the software running in the
>>> +guest, and is thus suitable for cloud scenarios.
>>> +
>>> +PV guests are out of scope for this project because, as is outlined above, they
>>> +have a symbiotic relationship with the hypervisor and therefore a certain level
>>> +of co-operation is required.
>>> +
>>> +HVM guests can already be migrated on Xen without guest co-operation but only
>>> +if they don’t have PV drivers installed[1] or are in power state S3. The
>>
>> S3 is very ACPI centric, so I would prefer if we avoid the term. I think
>> the non-ACPI description is "suspend to RAM". I would be OK is you
>> mention S3 in parenthesis.
> 
> I'm actually pulling this from the way the code is currently written, which is clearly quite x86 specific:
> 
> xc_hvm_param_get(CTX->xch, domid, HVM_PARAM_ACPI_S_STATE, &hvm_s_state)
> .
> .
> .
> if (dsps->type == LIBXL_DOMAIN_TYPE_HVM && (!hvm_pvdrv || hvm_s_state)) {
>      LOGD(DEBUG, domid, "Calling xc_domain_shutdown on HVM domain");
>      ret = xc_domain_shutdown(CTX->xch, domid, SHUTDOWN_suspend);
>      .
>      .
> }
> 
> So actually I should say 'not in power state S0'.

I understand that the current code is x86 specific. Arm would likely 
have a similar requirement although not based on ACPI.

However, my point here is nothing in the document says it is focusing on 
x86 only. The concept itself is not arch specific, the document is 
mostly x86 free except in a couple of bits. So I would like them to be 
rewritten in an arch-agnostic way.

Note that I am ok with arch-specific example.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration
  2020-03-04 15:36       ` Julien Grall
@ 2020-03-04 16:03         ` Durrant, Paul
  0 siblings, 0 replies; 11+ messages in thread
From: Durrant, Paul @ 2020-03-04 16:03 UTC (permalink / raw)
  To: 'Julien Grall', xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich

> -----Original Message-----
> >>> +HVM guests can already be migrated on Xen without guest co-operation but only
> >>> +if they don’t have PV drivers installed[1] or are in power state S3. The
> >>
> >> S3 is very ACPI centric, so I would prefer if we avoid the term. I think
> >> the non-ACPI description is "suspend to RAM". I would be OK is you
> >> mention S3 in parenthesis.
> >
> > I'm actually pulling this from the way the code is currently written, which is clearly quite x86
> specific:
> >
> > xc_hvm_param_get(CTX->xch, domid, HVM_PARAM_ACPI_S_STATE, &hvm_s_state)
> > .
> > .
> > .
> > if (dsps->type == LIBXL_DOMAIN_TYPE_HVM && (!hvm_pvdrv || hvm_s_state)) {
> >      LOGD(DEBUG, domid, "Calling xc_domain_shutdown on HVM domain");
> >      ret = xc_domain_shutdown(CTX->xch, domid, SHUTDOWN_suspend);
> >      .
> >      .
> > }
> >
> > So actually I should say 'not in power state S0'.
> 
> I understand that the current code is x86 specific. Arm would likely
> have a similar requirement although not based on ACPI.
> 
> However, my point here is nothing in the document says it is focusing on
> x86 only. The concept itself is not arch specific, the document is
> mostly x86 free except in a couple of bits. So I would like them to be
> rewritten in an arch-agnostic way.
> 
> Note that I am ok with arch-specific example.
> 

Sure. I'll try not to be x86 specific where it's not necessary.

  Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore data
  2020-02-13 10:53 ` [Xen-devel] [PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore data Paul Durrant
@ 2020-03-04 18:31   ` Julien Grall
  2020-03-05 15:03     ` Durrant, Paul
  0 siblings, 1 reply; 11+ messages in thread
From: Julien Grall @ 2020-03-04 18:31 UTC (permalink / raw)
  To: Paul Durrant, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich

Hi Paul,

On 13/02/2020 10:53, Paul Durrant wrote:
> This patch details proposes extra migration data and xenstore protocol
> extensions to support non-cooperative live migration of guests.
> 
> Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Julien Grall <julien@xen.org>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Wei Liu <wl@xen.org>
> 
> v5:
>   - Add QUIESCE
>   - Make semantics of <index> in GET_DOMAIN_WATCHES more clear
> 
> v4:
>   - Drop the restrictions on special paths
> 
> v3:
>   - New in v3
> ---
>   docs/designs/xenstore-migration.md | 136 +++++++++++++++++++++++++++++
>   1 file changed, 136 insertions(+)
>   create mode 100644 docs/designs/xenstore-migration.md
> 
> diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md
> new file mode 100644
> index 0000000000..5cfe2d9a7d
> --- /dev/null
> +++ b/docs/designs/xenstore-migration.md
> @@ -0,0 +1,136 @@
> +# Xenstore Migration
> +
> +## Background
> +
> +The design for *Non-Cooperative Migration of Guests*[1] explains that extra
> +save records are required in the migrations stream to allow a guest running
> +PV drivers to be migrated without its co-operation. Moreover the save
> +records must include details of registered xenstore watches as well as
> +content; information that cannot currently be recovered from `xenstored`,
> +and hence some extension to the xenstore protocol[2] will also be required.
> +
> +The *libxenlight Domain Image Format* specification[3] already defines a
> +record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
> +transferring xenstore data pertaining to the domain directly as it is
> +specified such that keys are relative to the path
> +`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
> +define at least one new save record type.
> +
> +## Proposal
> +
> +### New Save Record
> +
> +A new mandatory record type should be defined within the libxenlight Domain
> +Image Format:
> +
> +`0x00000007: DOMAIN_XENSTORE_DATA`
> +
> +The format of each of these new records should be as follows:
> +
> +
> +```
> +0     1     2     3     4     5     6     7 octet
> ++------------------------+------------------------+
> +| type                   | record specific data   |
> ++------------------------+                        |
> +...
> ++-------------------------------------------------+
> +```
> +
> +
> +| Field | Description |
> +|---|---|

Did you indend to add more - so | is on the same column as the onter lines?

> +| `type` | 0x00000000: invalid |
> +|        | 0x00000001: node data |
> +|        | 0x00000002: watch data |

Should not the last | be some of the columns on all the lines?

> +|        | 0x00000003 - 0xFFFFFFFF: reserved for future use |

Looking at the spec, the command TRANSACTION_END *must* be used with an 
existing transaction. As a guest would be migrate to a new domain, the 
transaction ID would now be invalid.

I understand that xenstored is able to cope with it, but such behavior 
is not described in the spec. So I am not sure we can expect a guest to 
cope with an error value other than the ones described for the command.

> +
> +
> +where data is always in the form of a NUL separated and terminated tuple
> +as follows
> +
> +
> +**node data**
> +
> +
> +`<path>|<value>|<perm-as-string>|`

I don't think this would work. From the spec, <value> is a binary data 
and therefore it can contain zero or nul. So you would not be able to 
find out where the <perm-as-string> starts.

Regarding the <perm-as-string>, it is only describing the permission for 
one domain. If multiple domains can access the node, then you would have 
multiple <perm-as-string>. Do we want to transfer all the permissions, 
if not how do we define which permissions should be transferred?

> +
> +
> +`<path>` is considered relative to the domain path `/local/domain/$domid`
> +and hence must not begin with `/`.
> +`<path>` and `<value>` should be suitable to formulate a `WRITE` operation
> +to the receiving xenstore and `<perm-as-string>` should be similarly suitable
> +to formulate a subsequent `SET_PERMS` operation.
> +
> +**watch data**
> +
> +
> +`<path>|<token>|`
> +
> +`<path>` again is considered relative and, together with `<token>`, should
> +be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below).

AFAICT, a guest is allowed to watch /. So is it a sensible thing to only 
transfer relative watch?

Also, how about special watch (i.e @...)?

> +
> +
> +### Protocol Extension
> +
> +Before xenstore state is migrated it is necessary to wait for any pending
> +reads, writes, watch registrations etc. to complete, and also to make sure
> +that xenstored does not start processing any new requests (so that new
> +requests remain pending on the shared ring for subsequent processing on the
> +new host). Hence the following operation is needed:
> +
> +```
> +QUIESCE                 <domid>|
> +
> +Complete processing of any request issued by the specified domain, and
> +do not process any further requests from the shared ring.
> +```
> +
> +The `WATCH` operation does not allow specification of a `<domid>`; it is
> +assumed that the watch pertains to the domain that owns the shared ring
> +over which the operation is passed. Hence, for the tool-stack to be able
> +to register a watch on behalf of a domain a new operation is needed:
> +
> +```
> +ADD_DOMAIN_WATCHES      <domid>|<watch>|+
> +
> +Adds watches on behalf of the specified domain.
> +
> +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
> +operation are identical to the domain issuing WATCH <path>|<token>| for
> +each <watch>.
> +```
> +
> +The watch information for a domain also needs to be extracted from the
> +sending xenstored so the following operation is also needed:
> +
> +```
> +GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|*
> +
> +Gets the list of watches that are currently registered for the domain.
> +
> +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
> +will start at <index> items into the the overall list of watches and may
> +be truncated (at a <watch> boundary) such that the returned data fits
> +within XENSTORE_PAYLOAD_MAX.
> +
> +If <index> is beyond the end of the overall list then the returned sub-
> +list will be empty. If the value of <gencnt> changes then it indicates
> +that the overall watch list has changed and thus it may be necessary
> +to re-issue the operation for previous values of <index>.
> +```
> +
> +It may also be desirable to state in the protocol specification that
> +the `INTRODUCE` operation should not clear the `<mfn>` specified such that

Not directly related to this patch, the '<mfn>' is slightly confusing 
because, AFAICT, this will actually hold an GFN. To avoid spreading more 
misuse, it would make sense to update the xenstore accordingly and use 
the new term here.

> +a `RELEASE` operation followed by an `INTRODUCE` operation form an
> +idempotent pair. The current implementation of *C xentored* does this
> +(in the `domain_conn_reset()` function) but this could be dropped as this
> +behaviour is not currently specified and the page will always be zeroed
> +for a newly created domain.
> +
> +
> +* * *
> +
> +[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md
> +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
> +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
> 

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore data
  2020-03-04 18:31   ` Julien Grall
@ 2020-03-05 15:03     ` Durrant, Paul
  0 siblings, 0 replies; 11+ messages in thread
From: Durrant, Paul @ 2020-03-05 15:03 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich

> -----Original Message-----
> From: Julien Grall <julien@xen.org>
> Sent: 04 March 2020 18:32
> To: Durrant, Paul <pdurrant@amazon.co.uk>; xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; George Dunlap <George.Dunlap@eu.citrix.com>; Ian
> Jackson <ian.jackson@eu.citrix.com>; Jan Beulich <jbeulich@suse.com>; Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com>; Stefano Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>
> Subject: RE: [EXTERNAL][PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore
> data
> 
> CAUTION: This email originated from outside of the organization. Do not click links or open
> attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Hi Paul,
> 
> On 13/02/2020 10:53, Paul Durrant wrote:
> > This patch details proposes extra migration data and xenstore protocol
> > extensions to support non-cooperative live migration of guests.
> >
> > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> > ---
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Julien Grall <julien@xen.org>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Wei Liu <wl@xen.org>
> >
> > v5:
> >   - Add QUIESCE
> >   - Make semantics of <index> in GET_DOMAIN_WATCHES more clear
> >
> > v4:
> >   - Drop the restrictions on special paths
> >
> > v3:
> >   - New in v3
> > ---
> >   docs/designs/xenstore-migration.md | 136 +++++++++++++++++++++++++++++
> >   1 file changed, 136 insertions(+)
> >   create mode 100644 docs/designs/xenstore-migration.md
> >
> > diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md
> > new file mode 100644
> > index 0000000000..5cfe2d9a7d
> > --- /dev/null
> > +++ b/docs/designs/xenstore-migration.md
> > @@ -0,0 +1,136 @@
> > +# Xenstore Migration
> > +
> > +## Background
> > +
> > +The design for *Non-Cooperative Migration of Guests*[1] explains that extra
> > +save records are required in the migrations stream to allow a guest running
> > +PV drivers to be migrated without its co-operation. Moreover the save
> > +records must include details of registered xenstore watches as well as
> > +content; information that cannot currently be recovered from `xenstored`,
> > +and hence some extension to the xenstore protocol[2] will also be required.
> > +
> > +The *libxenlight Domain Image Format* specification[3] already defines a
> > +record type `EMULATOR_XENSTORE_DATA` but this is not suitable for
> > +transferring xenstore data pertaining to the domain directly as it is
> > +specified such that keys are relative to the path
> > +`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to
> > +define at least one new save record type.
> > +
> > +## Proposal
> > +
> > +### New Save Record
> > +
> > +A new mandatory record type should be defined within the libxenlight Domain
> > +Image Format:
> > +
> > +`0x00000007: DOMAIN_XENSTORE_DATA`
> > +
> > +The format of each of these new records should be as follows:
> > +
> > +
> > +```
> > +0     1     2     3     4     5     6     7 octet
> > ++------------------------+------------------------+
> > +| type                   | record specific data   |
> > ++------------------------+                        |
> > +...
> > ++-------------------------------------------------+
> > +```
> > +
> > +
> > +| Field | Description |
> > +|---|---|
> 
> Did you indend to add more - so | is on the same column as the onter lines?
> 

Yep, cut'n'paste error.

> > +| `type` | 0x00000000: invalid |
> > +|        | 0x00000001: node data |
> > +|        | 0x00000002: watch data |
> 
> Should not the last | be some of the columns on all the lines?
> 
> > +|        | 0x00000003 - 0xFFFFFFFF: reserved for future use |
> 
> Looking at the spec, the command TRANSACTION_END *must* be used with an
> existing transaction. As a guest would be migrate to a new domain, the
> transaction ID would now be invalid.
> 
> I understand that xenstored is able to cope with it, but such behavior
> is not described in the spec. So I am not sure we can expect a guest to
> cope with an error value other than the ones described for the command.
> 

And (as we discussed offline) there would be an issue if the migrated guest started a new transaction before completing one that was started pre-migration, as the ids may clash. So, we are going to need a record to transfer open transaction ids so that we can reserve them in the receiving xenstored.

> > +
> > +
> > +where data is always in the form of a NUL separated and terminated tuple
> > +as follows
> > +
> > +
> > +**node data**
> > +
> > +
> > +`<path>|<value>|<perm-as-string>|`
> 
> I don't think this would work. From the spec, <value> is a binary data
> and therefore it can contain zero or nul. So you would not be able to
> find out where the <perm-as-string> starts.
> 
> Regarding the <perm-as-string>, it is only describing the permission for
> one domain. If multiple domains can access the node, then you would have
> multiple <perm-as-string>. Do we want to transfer all the permissions,
> if not how do we define which permissions should be transferred?

Yes this should cope with multiple perms and binary data, even though I think we don't necessarily need it in the normal case.

> 
> > +
> > +
> > +`<path>` is considered relative to the domain path `/local/domain/$domid`
> > +and hence must not begin with `/`.
> > +`<path>` and `<value>` should be suitable to formulate a `WRITE` operation
> > +to the receiving xenstore and `<perm-as-string>` should be similarly suitable
> > +to formulate a subsequent `SET_PERMS` operation.
> > +
> > +**watch data**
> > +
> > +
> > +`<path>|<token>|`
> > +
> > +`<path>` again is considered relative and, together with `<token>`, should
> > +be suitable to formulate an `ADD_DOMAIN_WATCHES` operation (see below).
> 
> AFAICT, a guest is allowed to watch /. So is it a sensible thing to only
> transfer relative watch?
> 
> Also, how about special watch (i.e @...)?

I guess we need to cope with whatever a guest is allowed to register... which appears to be anything.

> 
> > +
> > +
> > +### Protocol Extension
> > +
> > +Before xenstore state is migrated it is necessary to wait for any pending
> > +reads, writes, watch registrations etc. to complete, and also to make sure
> > +that xenstored does not start processing any new requests (so that new
> > +requests remain pending on the shared ring for subsequent processing on the
> > +new host). Hence the following operation is needed:
> > +
> > +```
> > +QUIESCE                 <domid>|
> > +
> > +Complete processing of any request issued by the specified domain, and
> > +do not process any further requests from the shared ring.
> > +```
> > +
> > +The `WATCH` operation does not allow specification of a `<domid>`; it is
> > +assumed that the watch pertains to the domain that owns the shared ring
> > +over which the operation is passed. Hence, for the tool-stack to be able
> > +to register a watch on behalf of a domain a new operation is needed:
> > +
> > +```
> > +ADD_DOMAIN_WATCHES      <domid>|<watch>|+
> > +
> > +Adds watches on behalf of the specified domain.
> > +
> > +<watch> is a NUL separated tuple of <path>|<token>. The semantics of this
> > +operation are identical to the domain issuing WATCH <path>|<token>| for
> > +each <watch>.
> > +```
> > +
> > +The watch information for a domain also needs to be extracted from the
> > +sending xenstored so the following operation is also needed:
> > +
> > +```
> > +GET_DOMAIN_WATCHES      <domid>|<index>   <gencnt>|<watch>|*
> > +
> > +Gets the list of watches that are currently registered for the domain.
> > +
> > +<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned
> > +will start at <index> items into the the overall list of watches and may
> > +be truncated (at a <watch> boundary) such that the returned data fits
> > +within XENSTORE_PAYLOAD_MAX.
> > +
> > +If <index> is beyond the end of the overall list then the returned sub-
> > +list will be empty. If the value of <gencnt> changes then it indicates
> > +that the overall watch list has changed and thus it may be necessary
> > +to re-issue the operation for previous values of <index>.
> > +```
> > +
> > +It may also be desirable to state in the protocol specification that
> > +the `INTRODUCE` operation should not clear the `<mfn>` specified such that
> 
> Not directly related to this patch, the '<mfn>' is slightly confusing
> because, AFAICT, this will actually hold an GFN. To avoid spreading more
> misuse, it would make sense to update the xenstore accordingly and use
> the new term here.
> 

Ok, I can add a small patch to modify the doc.

  Paul

> > +a `RELEASE` operation followed by an `INTRODUCE` operation form an
> > +idempotent pair. The current implementation of *C xentored* does this
> > +(in the `domain_conn_reset()` function) but this could be dropped as this
> > +behaviour is not currently specified and the page will always be zeroed
> > +for a newly created domain.
> > +
> > +
> > +* * *
> > +
> > +[1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-
> migration.md
> > +[2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt
> > +[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc
> >
> 
> Cheers,
> 
> --
> Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-03-05 15:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-13 10:53 [Xen-devel] [PATCH v5 0/2] docs: Migration design documents Paul Durrant
2020-02-13 10:53 ` [Xen-devel] [PATCH v5 1/2] docs/designs: Add a design document for non-cooperative live migration Paul Durrant
2020-03-04 15:10   ` Julien Grall
2020-03-04 15:23     ` Durrant, Paul
2020-03-04 15:36       ` Julien Grall
2020-03-04 16:03         ` Durrant, Paul
2020-02-13 10:53 ` [Xen-devel] [PATCH v5 2/2] docs/designs: Add a design document for migration of xenstore data Paul Durrant
2020-03-04 18:31   ` Julien Grall
2020-03-05 15:03     ` Durrant, Paul
2020-02-20 12:54 ` [Xen-devel] [PATCH v5 0/2] docs: Migration design documents Durrant, Paul
2020-02-28 17:20   ` Durrant, Paul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).