All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] Design session report: Live-Updating Xen
@ 2019-07-15 18:57 Foerster, Leonard
  2019-07-15 19:31 ` Sarah Newman
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Foerster, Leonard @ 2019-07-15 18:57 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 6514 bytes --]

Here is the summary/notes from the Xen Live-Update Design session last week.
I tried to tie together the different topics we talked about into some sections.

https://cryptpad.fr/pad/#/2/pad/edit/fCwXg1GmSXXG8bc4ridHAsnR/

--
Leonard

LIVE UPDATING XEN - DESING SESSION

Brief project overview:
	-> We want to build Xen Live-update
	-> early prototyping phase
	IDEA: change running hypervisor to new one without guest disruptions
	-> Reasons:
		* Security - we might need an updated versions for vulnerability mitigation
		* Development cycle acceleration - fast switch to hypervisor during development
		* Maintainability - reduce version diversity in the fleet
	-> We are currently eyeing a combination of guest transparent live migration
		and kexec into a new xen build
	-> For more details: https://xensummit19.sched.com/event/PFVQ/live-updating-xen-amit-shah-david-woodhouse-amazon

Terminology:
	Running Xen -> The xen running on the host before update (Source)
	Target Xen -> The xen we are updating *to*

Design discussions:

Live-update ties into multiple other projects currently done in the Xen-project:

	* Secret free Xen: reduce the footprint of guest relevant data in Xen
		-> less state we might have to handle in the live update case
	* dom0less: bootstrap domains without the involvement of dom0
		-> this might come in handy to at least setup and continue dom0 on target xen
		-> If we have this this might also enable us to de-serialize the state for
			other guest-domains in xen and not have to wait for dom0 to do this

We want to just keep domain and hardware state
	-> Xen is supposedly completely to be exchanged
	-> We have to keep around the IOMMU page tables and do not touch them
		-> this might also come in handy for some newer UEFI boot related issues?
		-> We might have to go and re-inject certain interrupts
	-> do we need to dis-aggregate xenheap and domheap here?
		-> We are currently trying to avoid this

A key cornerstone for Live-update is guest transparent live migration
	-> This means we are using a well defined ABI for saving/restoring domain state
		-> We do only rely on domain state and no internal xen state
	-> The idea is to migrate the guest not from one machine to another (in space)
		but on the same machine from one hypervisor to another (in time)
	-> In addition we want to keep as much as possible in memory unchanged and feed
		this back to the target domain in order to save time
	-> This means we will need additional info on those memory areas and have to
		be super careful not to stomp over them while starting the target xen
	-> for live migration: domid is a problem in this case
		-> randomize and pray does not work on smaller fleets
		-> this is not a problem for live-update
		-> BUT: as a community we shoudl make this restriction go away

Exchanging the Hypervisor using kexec
	-> We have patches on upstream kexec-tools merged that enable multiboot2 for Xen
	-> We can now load the target xen binary to the crashdump region to not stomp
		over any valuable date we might need later
	-> But using the crashdump region for this has drawbacks when it comes to debugging
		and we might want to think about this later
		-> What happens when live-update goes wrong?
		-> Option: Increase Crashdump region size and partition it or have a separate
			reserved live-update region to load the target xen into 
		-> Separate region or partitioned region is not a priority for V1 but should
			be on the road map for future versions

Who serializes and deserializes domain state?
	-> dom0: This should work fine, but who does this for dom0 itself?
	-> Xen: This will need some more work, but might covered mostly by the dom0less effort on the arm side
		-> this will need some work for x86, but Stefano does not consider this a lot of work
	-> This would mean: serialize domain state into multiboot module and set domains
		up after kexecing xen in the dom0less manner
		-> make multiboot module general enough so we can tag it as boot/resume/create/etc.
			-> this will also enable us to do per-guest feature enablement
			-> finer granular than specifying on cmdline
			-> cmdline stuff is mostly broken, needs to be fixed for nested either way
			-> domain create flags is a mess

Live update instead of crashdump?
	-> Can we use such capabilities to recover from a crash be "restarting" xen on a crash?
		-> live updating into (the same) xen on crash
	-> crashing is a good mechanism because it happens if something is really broken and
		most likely not recoverable
	-> Live update should be a conscious process and not something you do as reaction to a crash
		-> something is really broken if we crash
		-> we should not proactively restart xen on crash
			-> we might run into crash loops
	-> maybe this can be done in the future, but it is not changing anything for the design
		-> if anybody wants to wire this up once live update is there, that should not be too hard
		-> then you want to think about: scattering the domains to multiple other hosts to not keep
			them on broken machines

We should use this opportunity to clean up certain parts of the code base:
	-> interface for domain information is a mess
		-> HVM and PV have some shared data but completely different ways of accessing it

Volume of patches:
	-> Live update: still developing, we do not know yet
	-> guest transparent live migration:
		-> We have roughly 100 patches over time
		-> we believe most of this has just to be cleaned up/squashed and
			will land us at a reasonable much lower number
		-> this also needs 2-3 dom0 kernel patches

Summary of action items:
	-> coordinate with dom0less effort on what we can use and contribute there
	-> fix the domid clash problem
	-> Decision on usage of crash kernel area
	-> fix live migration patch set to include yet unsupported backends
		-> clean up the patch set
		-> upstream it

Longer term vision:

* Have a tiny hypervisor between Guest and Xen that handles the common cases
	-> this enables (almost) zero downtime for the guest
	-> the tiny hypervisor will maintain the guest while the underlying xen is kexecing into new build

* Somebody someday will want to get rid of the long tail of old xen versions in a fleet
	-> live patch old running versions with live update capability?
	-> crashdumping into a new hypervisor?
		-> "crazy idea" but this will likely come up at some point

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-15 18:57 [Xen-devel] Design session report: Live-Updating Xen Foerster, Leonard
@ 2019-07-15 19:31 ` Sarah Newman
  2019-07-16  3:48   ` Juergen Gross
  2019-07-16 23:51 ` Andrew Cooper
  2019-07-18  9:29 ` Paul Durrant
  2 siblings, 1 reply; 17+ messages in thread
From: Sarah Newman @ 2019-07-15 19:31 UTC (permalink / raw)
  To: Foerster, Leonard, xen-devel

On 7/15/19 11:57 AM, Foerster, Leonard wrote:
...
> A key cornerstone for Live-update is guest transparent live migration
...
> 	-> for live migration: domid is a problem in this case
> 		-> randomize and pray does not work on smaller fleets
> 		-> this is not a problem for live-update
> 		-> BUT: as a community we shoudl make this restriction go away

Andrew Cooper pointed out to me that manually assigning domain IDs is supported in much of the code already. If guest transparent live migration gets 
merged, we'll look at passing in a domain ID to xl, which would be good enough for us. I don't know about the other toolstacks.

--Sarah

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-15 19:31 ` Sarah Newman
@ 2019-07-16  3:48   ` Juergen Gross
  2019-07-16  4:20     ` Sarah Newman
  0 siblings, 1 reply; 17+ messages in thread
From: Juergen Gross @ 2019-07-16  3:48 UTC (permalink / raw)
  To: Sarah Newman, Foerster, Leonard, xen-devel

On 15.07.19 21:31, Sarah Newman wrote:
> On 7/15/19 11:57 AM, Foerster, Leonard wrote:
> ...
>> A key cornerstone for Live-update is guest transparent live migration
> ...
>>     -> for live migration: domid is a problem in this case
>>         -> randomize and pray does not work on smaller fleets
>>         -> this is not a problem for live-update
>>         -> BUT: as a community we shoudl make this restriction go away
> 
> Andrew Cooper pointed out to me that manually assigning domain IDs is 
> supported in much of the code already. If guest transparent live 
> migration gets merged, we'll look at passing in a domain ID to xl, which 
> would be good enough for us. I don't know about the other toolstacks.

The main problem is the case where on the target host the domid of the
migrated domain is already in use by another domain. So you either need
a domid allocator spanning all hosts or the change of domid during
migration must be hidden from the guest for guest transparent migration.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-16  3:48   ` Juergen Gross
@ 2019-07-16  4:20     ` Sarah Newman
  2019-07-16 22:27       ` Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: Sarah Newman @ 2019-07-16  4:20 UTC (permalink / raw)
  To: Juergen Gross, Foerster, Leonard, xen-devel

On 7/15/19 8:48 PM, Juergen Gross wrote:
> On 15.07.19 21:31, Sarah Newman wrote:
>> On 7/15/19 11:57 AM, Foerster, Leonard wrote:
>> ...
>>> A key cornerstone for Live-update is guest transparent live migration
>> ...
>>>     -> for live migration: domid is a problem in this case
>>>         -> randomize and pray does not work on smaller fleets
>>>         -> this is not a problem for live-update
>>>         -> BUT: as a community we shoudl make this restriction go away
>>
>> Andrew Cooper pointed out to me that manually assigning domain IDs is supported in much of the code already. If guest transparent live migration 
>> gets merged, we'll look at passing in a domain ID to xl, which would be good enough for us. I don't know about the other toolstacks.
> 
> The main problem is the case where on the target host the domid of the
> migrated domain is already in use by another domain. So you either need
> a domid allocator spanning all hosts or the change of domid during
> migration must be hidden from the guest for guest transparent migration.

Yes. There are some cluster management systems which use xl rather than xapi.
They could be extended to manage domain IDs if it's too difficult to allow
the domain ID to change during migration.

--Sarah

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-16  4:20     ` Sarah Newman
@ 2019-07-16 22:27       ` Andrew Cooper
  2019-07-18  9:00         ` Juergen Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2019-07-16 22:27 UTC (permalink / raw)
  To: Sarah Newman, Juergen Gross, Foerster,  Leonard, xen-devel

On 16/07/2019 05:20, Sarah Newman wrote:
> On 7/15/19 8:48 PM, Juergen Gross wrote:
>> On 15.07.19 21:31, Sarah Newman wrote:
>>> On 7/15/19 11:57 AM, Foerster, Leonard wrote:
>>> ...
>>>> A key cornerstone for Live-update is guest transparent live migration
>>> ...
>>>>     -> for live migration: domid is a problem in this case
>>>>         -> randomize and pray does not work on smaller fleets
>>>>         -> this is not a problem for live-update
>>>>         -> BUT: as a community we shoudl make this restriction go away
>>>
>>> Andrew Cooper pointed out to me that manually assigning domain IDs
>>> is supported in much of the code already. If guest transparent live
>>> migration gets merged, we'll look at passing in a domain ID to xl,
>>> which would be good enough for us. I don't know about the other
>>> toolstacks.
>>
>> The main problem is the case where on the target host the domid of the
>> migrated domain is already in use by another domain. So you either need
>> a domid allocator spanning all hosts or the change of domid during
>> migration must be hidden from the guest for guest transparent migration.
>
> Yes. There are some cluster management systems which use xl rather
> than xapi.
> They could be extended to manage domain IDs if it's too difficult to
> allow
> the domain ID to change during migration.

For a v1 feature, having a restriction of "you must manage domids across
the cluster" is a fine.  Guest-transparent migration is a very important
feature, and one where we are lacking in relation to other hypervisors.

Longer term, we as the Xen community need to figure out a way to remove
the dependency on domids, at which point the cluster-wide management
restriction can be dropped.  This isn't going to be a trivial task, but
it will be a worthwhile one.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-15 18:57 [Xen-devel] Design session report: Live-Updating Xen Foerster, Leonard
  2019-07-15 19:31 ` Sarah Newman
@ 2019-07-16 23:51 ` Andrew Cooper
  2019-07-17  7:09   ` Jan Beulich
  2019-07-18  9:29 ` Paul Durrant
  2 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2019-07-16 23:51 UTC (permalink / raw)
  To: Foerster, Leonard, xen-devel

On 15/07/2019 19:57, Foerster, Leonard wrote:
> Here is the summary/notes from the Xen Live-Update Design session last week.
> I tried to tie together the different topics we talked about into some sections.
>
> https://cryptpad.fr/pad/#/2/pad/edit/fCwXg1GmSXXG8bc4ridHAsnR/
>
> --
> Leonard
>
> LIVE UPDATING XEN - DESING SESSION
>
> Brief project overview:
> 	-> We want to build Xen Live-update
> 	-> early prototyping phase
> 	IDEA: change running hypervisor to new one without guest disruptions
> 	-> Reasons:
> 		* Security - we might need an updated versions for vulnerability mitigation

I know I'm going to regret saying this, but livepatches are probably a
better bet in most cases for targeted security fixes.

> 		* Development cycle acceleration - fast switch to hypervisor during development
> 		* Maintainability - reduce version diversity in the fleet

:) I don't expect you to admit anything concrete on xen-devel, but I do
hope the divergence it at least a little better under control than last
time I got given an answer to this question.

> 	-> We are currently eyeing a combination of guest transparent live migration
> 		and kexec into a new xen build
> 	-> For more details: https://xensummit19.sched.com/event/PFVQ/live-updating-xen-amit-shah-david-woodhouse-amazon
>
> Terminology:
> 	Running Xen -> The xen running on the host before update (Source)
> 	Target Xen -> The xen we are updating *to*
>
> Design discussions:
>
> Live-update ties into multiple other projects currently done in the Xen-project:
>
> 	* Secret free Xen: reduce the footprint of guest relevant data in Xen
> 		-> less state we might have to handle in the live update case

I don't immediately see how this is related.  Secret-free Xen is to do
with having fewer things mapped by default.  It doesn't fundamentally
change the data that Xen needs to hold about guests, nor how this gets
arranged in memory.

> 	* dom0less: bootstrap domains without the involvement of dom0
> 		-> this might come in handy to at least setup and continue dom0 on target xen
> 		-> If we have this this might also enable us to de-serialize the state for
> 			other guest-domains in xen and not have to wait for dom0 to do this

Reconstruction of dom0 is something which Xen will definitely need to
do.  With the memory still in place, its just a fairly small of register
state which needs restoring.

That said, reconstruction of the typerefs will be an issue.  Walking
over a fully populated L4 tree can (in theory) take minutes, and its not
safe to just start executing without reconstruction.

Depending on how bad it is in practice, one option might be to do a
demand validate of %rip and %rsp, along with a hybrid shadow mode which
turns faults into typerefs, which would allow the gross cost of
revalidation to be amortised while the vcpus were executing.  We would
definitely want some kind of logic to aggressively typeref outstanding
pagetables so the shadow mode could be turned off.

> We want to just keep domain and hardware state
> 	-> Xen is supposedly completely to be exchanged
> 	-> We have to keep around the IOMMU page tables and do not touch them
> 		-> this might also come in handy for some newer UEFI boot related issues?

This is for Pre-DXE DMA protection, which IIRC is part of the UEFI 2.7
spec.  It basically means that the IOMMU is set up and inhibiting DMA
before any firmware starts using RAM.

In both cases, it involves Xen's IOMMU driver being capable of
initialising with the IOMMU already active, and in a way which keeps DMA
and interrupt remapping safe.

This is a chunk of work which should probably be split out into an
independent prerequisite.

> 		-> We might have to go and re-inject certain interrupts

What hardware are you targeting here?  IvyBridge and later has a posted
interrupt descriptor which can accumulate pending interrupts (at least
manually), and newer versions (Broadwell?) can accumulate interrupts
directly from hardware.

> 	-> do we need to dis-aggregate xenheap and domheap here?
> 		-> We are currently trying to avoid this

I don't think this will be necessary, or indeed a useful thing to try
considering.  There should be an absolute minimal amount of dependency
between the two versions of Xen, to allow for the maximum flexibility in
upgradeable scenarios.

>
> A key cornerstone for Live-update is guest transparent live migration
> 	-> This means we are using a well defined ABI for saving/restoring domain state
> 		-> We do only rely on domain state and no internal xen state

Absolutely.  One issue I discussed with David a while ago is that even
across an upgrade of Xen, the format of the EPT/NPT pagetables might
change, at least in terms of the layout of software bits.  (Especially
for EPT where we slowly lose software bits to new hardware features we
wish to use.)

> 	-> The idea is to migrate the guest not from one machine to another (in space)
> 		but on the same machine from one hypervisor to another (in time)
> 	-> In addition we want to keep as much as possible in memory unchanged and feed
> 		this back to the target domain in order to save time
> 	-> This means we will need additional info on those memory areas and have to
> 		be super careful not to stomp over them while starting the target xen
> 	-> for live migration: domid is a problem in this case
> 		-> randomize and pray does not work on smaller fleets
> 		-> this is not a problem for live-update
> 		-> BUT: as a community we shoudl make this restriction go away
>
> Exchanging the Hypervisor using kexec
> 	-> We have patches on upstream kexec-tools merged that enable multiboot2 for Xen
> 	-> We can now load the target xen binary to the crashdump region to not stomp
> 		over any valuable date we might need later
> 	-> But using the crashdump region for this has drawbacks when it comes to debugging
> 		and we might want to think about this later
> 		-> What happens when live-update goes wrong?
> 		-> Option: Increase Crashdump region size and partition it or have a separate
> 			reserved live-update region to load the target xen into 
> 		-> Separate region or partitioned region is not a priority for V1 but should
> 			be on the road map for future versions

In terms of things needing physical contiguity, there is the Xen image
itself (a few MB), various driver datastructures (the IOMMU interrupt
remapping tables in particular, but I think we can probably scale the
size by the number of vectors behind them in practice, rather than
always making an order 7(or 8?) allocation to cover all 64k possible
handles.)  I think some of the directmap setup also expects to be able
to find free 2M superpages.

>
> Who serializes and deserializes domain state?
> 	-> dom0: This should work fine, but who does this for dom0 itself?
> 	-> Xen: This will need some more work, but might covered mostly by the dom0less effort on the arm side
> 		-> this will need some work for x86, but Stefano does not consider this a lot of work
> 	-> This would mean: serialize domain state into multiboot module and set domains
> 		up after kexecing xen in the dom0less manner
> 		-> make multiboot module general enough so we can tag it as boot/resume/create/etc.
> 			-> this will also enable us to do per-guest feature enablement

What is the intent here?

> 			-> finer granular than specifying on cmdline
> 			-> cmdline stuff is mostly broken, needs to be fixed for nested either way
> 			-> domain create flags is a mess

There is going to have to be some kind of translation from old state to
new settings.  In the past, lots of Xen was based on global settings, an
this is slowly being fixed into concrete per-domain settings.

>
> Live update instead of crashdump?
> 	-> Can we use such capabilities to recover from a crash be "restarting" xen on a crash?
> 		-> live updating into (the same) xen on crash
> 	-> crashing is a good mechanism because it happens if something is really broken and
> 		most likely not recoverable
> 	-> Live update should be a conscious process and not something you do as reaction to a crash
> 		-> something is really broken if we crash
> 		-> we should not proactively restart xen on crash
> 			-> we might run into crash loops
> 	-> maybe this can be done in the future, but it is not changing anything for the design
> 		-> if anybody wants to wire this up once live update is there, that should not be too hard
> 		-> then you want to think about: scattering the domains to multiple other hosts to not keep
> 			them on broken machines
>
> We should use this opportunity to clean up certain parts of the code base:
> 	-> interface for domain information is a mess
> 		-> HVM and PV have some shared data but completely different ways of accessing it
>
> Volume of patches:
> 	-> Live update: still developing, we do not know yet
> 	-> guest transparent live migration:
> 		-> We have roughly 100 patches over time
> 		-> we believe most of this has just to be cleaned up/squashed and
> 			will land us at a reasonable much lower number
> 		-> this also needs 2-3 dom0 kernel patches
>
> Summary of action items:
> 	-> coordinate with dom0less effort on what we can use and contribute there
> 	-> fix the domid clash problem
> 	-> Decision on usage of crash kernel area
> 	-> fix live migration patch set to include yet unsupported backends
> 		-> clean up the patch set
> 		-> upstream it
>
> Longer term vision:
>
> * Have a tiny hypervisor between Guest and Xen that handles the common cases
> 	-> this enables (almost) zero downtime for the guest
> 	-> the tiny hypervisor will maintain the guest while the underlying xen is kexecing into new build
>
> * Somebody someday will want to get rid of the long tail of old xen versions in a fleet
> 	-> live patch old running versions with live update capability?
> 	-> crashdumping into a new hypervisor?
> 		-> "crazy idea" but this will likely come up at some point

How much do you need to patch an old Xen to have kexec take over
cleanly?  Almost all of the complexity is on the destination side
AFAICT, which is good from a development point of view.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-16 23:51 ` Andrew Cooper
@ 2019-07-17  7:09   ` Jan Beulich
  2019-07-17 11:26     ` Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2019-07-17  7:09 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Leonard Foerster

On 17.07.2019 01:51, Andrew Cooper wrote:
> On 15/07/2019 19:57, Foerster, Leonard wrote:
>> 	* dom0less: bootstrap domains without the involvement of dom0
>> 		-> this might come in handy to at least setup and continue dom0 on target xen
>> 		-> If we have this this might also enable us to de-serialize the state for
>> 			other guest-domains in xen and not have to wait for dom0 to do this
> 
> Reconstruction of dom0 is something which Xen will definitely need to
> do.  With the memory still in place, its just a fairly small of register
> state which needs restoring.
> 
> That said, reconstruction of the typerefs will be an issue.  Walking
> over a fully populated L4 tree can (in theory) take minutes, and its not
> safe to just start executing without reconstruction.
> 
> Depending on how bad it is in practice, one option might be to do a
> demand validate of %rip and %rsp, along with a hybrid shadow mode which
> turns faults into typerefs, which would allow the gross cost of
> revalidation to be amortised while the vcpus were executing.  We would
> definitely want some kind of logic to aggressively typeref outstanding
> pagetables so the shadow mode could be turned off.

Neither walking the page table trees nor and on-demand re-creation can
possibly work, as pointed out during (partly informal) discussion: At
the very least the allocated and pinned states of pages can only be
transferred. Hence we seem to have come to agreement that struct
page_info instances have to be transformed (in place if possible, i.e.
when the sizes match, otherwise by copying).
>> 		-> We might have to go and re-inject certain interrupts
> 
> What hardware are you targeting here?  IvyBridge and later has a posted
> interrupt descriptor which can accumulate pending interrupts (at least
> manually), and newer versions (Broadwell?) can accumulate interrupts
> directly from hardware.

For HVM/PVH perhaps that's good enough. What about PV though?

>> A key cornerstone for Live-update is guest transparent live migration
>> 	-> This means we are using a well defined ABI for saving/restoring domain state
>> 		-> We do only rely on domain state and no internal xen state
> 
> Absolutely.  One issue I discussed with David a while ago is that even
> across an upgrade of Xen, the format of the EPT/NPT pagetables might
> change, at least in terms of the layout of software bits.  (Especially
> for EPT where we slowly lose software bits to new hardware features we
> wish to use.)

Right, and therefore a similar transformation like for struct page_info
may be unavoidable here too.

Re-using large data structures (or arrays thereof) may also turn out
useful in terms of latency until the new Xen actually becomes ready to
resume.

Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-17  7:09   ` Jan Beulich
@ 2019-07-17 11:26     ` Andrew Cooper
  2019-07-17 13:02       ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2019-07-17 11:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Leonard Foerster

On 17/07/2019 08:09, Jan Beulich wrote:
> On 17.07.2019 01:51, Andrew Cooper wrote:
>> On 15/07/2019 19:57, Foerster, Leonard wrote:
>>> 	* dom0less: bootstrap domains without the involvement of dom0
>>> 		-> this might come in handy to at least setup and continue dom0 on target xen
>>> 		-> If we have this this might also enable us to de-serialize the state for
>>> 			other guest-domains in xen and not have to wait for dom0 to do this
>> Reconstruction of dom0 is something which Xen will definitely need to
>> do.  With the memory still in place, its just a fairly small of register
>> state which needs restoring.
>>
>> That said, reconstruction of the typerefs will be an issue.  Walking
>> over a fully populated L4 tree can (in theory) take minutes, and its not
>> safe to just start executing without reconstruction.
>>
>> Depending on how bad it is in practice, one option might be to do a
>> demand validate of %rip and %rsp, along with a hybrid shadow mode which
>> turns faults into typerefs, which would allow the gross cost of
>> revalidation to be amortised while the vcpus were executing.  We would
>> definitely want some kind of logic to aggressively typeref outstanding
>> pagetables so the shadow mode could be turned off.
> Neither walking the page table trees nor and on-demand re-creation can
> possibly work, as pointed out during (partly informal) discussion: At
> the very least the allocated and pinned states of pages can only be
> transferred.

Pinned state exists in the current migrate stream.  Allocated does not -
it is an internal detail of how Xen handles the memory.

But yes - this observation means that we can't simply walk the guest
pagetables.

> Hence we seem to have come to agreement that struct
> page_info instances have to be transformed (in place if possible, i.e.
> when the sizes match, otherwise by copying).

-10 to this idea, if it can possibly be avoided.  In this case, it
definitely can be avoided.

We do not want to be grovelling around in the old Xen's datastructures,
because that adds a binary A=>B translation which is
per-old-version-of-xen, meaning that you need a custom build of each
target Xen which depends on the currently-running Xen, or have to
maintain a matrix of old versions which will be dependent on the local
changes, and therefore not suitable for upstream.

>>> 		-> We might have to go and re-inject certain interrupts
>> What hardware are you targeting here?  IvyBridge and later has a posted
>> interrupt descriptor which can accumulate pending interrupts (at least
>> manually), and newer versions (Broadwell?) can accumulate interrupts
>> directly from hardware.
> For HVM/PVH perhaps that's good enough. What about PV though?

What about PV?

The in-guest evtchn data structure will accumulate events just like a
posted interrupt descriptor.  Real interrupts will queue in the LAPIC
during the transition period.

We obviously can't let interrupts be dropped, but there also shouldn't
be any need to re-inject any.

>>> A key cornerstone for Live-update is guest transparent live migration
>>> 	-> This means we are using a well defined ABI for saving/restoring domain state
>>> 		-> We do only rely on domain state and no internal xen state
>> Absolutely.  One issue I discussed with David a while ago is that even
>> across an upgrade of Xen, the format of the EPT/NPT pagetables might
>> change, at least in terms of the layout of software bits.  (Especially
>> for EPT where we slowly lose software bits to new hardware features we
>> wish to use.)
> Right, and therefore a similar transformation like for struct page_info
> may be unavoidable here too.

None of that lives in the current migrate stream.  Again - it is
internal details, so is not something which is appropriate to be
inspected by the target Xen.

> Re-using large data structures (or arrays thereof) may also turn out
> useful in terms of latency until the new Xen actually becomes ready to
> resume.

When it comes to optimising the latency, there is a fair amount we might
be able to do ahead of the critical region, but I still think this would
be better done in terms of a "clean start" in the new Xen to reduce
binary dependences.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-17 11:26     ` Andrew Cooper
@ 2019-07-17 13:02       ` Jan Beulich
  2019-07-17 18:40         ` Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2019-07-17 13:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Leonard Foerster

On 17.07.2019 13:26, Andrew Cooper wrote:
> On 17/07/2019 08:09, Jan Beulich wrote:
>> On 17.07.2019 01:51, Andrew Cooper wrote:
>>> On 15/07/2019 19:57, Foerster, Leonard wrote:
>>>> 	* dom0less: bootstrap domains without the involvement of dom0
>>>> 		-> this might come in handy to at least setup and continue dom0 on target xen
>>>> 		-> If we have this this might also enable us to de-serialize the state for
>>>> 			other guest-domains in xen and not have to wait for dom0 to do this
>>> Reconstruction of dom0 is something which Xen will definitely need to
>>> do.  With the memory still in place, its just a fairly small of register
>>> state which needs restoring.
>>>
>>> That said, reconstruction of the typerefs will be an issue.  Walking
>>> over a fully populated L4 tree can (in theory) take minutes, and its not
>>> safe to just start executing without reconstruction.
>>>
>>> Depending on how bad it is in practice, one option might be to do a
>>> demand validate of %rip and %rsp, along with a hybrid shadow mode which
>>> turns faults into typerefs, which would allow the gross cost of
>>> revalidation to be amortised while the vcpus were executing.  We would
>>> definitely want some kind of logic to aggressively typeref outstanding
>>> pagetables so the shadow mode could be turned off.
>> Neither walking the page table trees nor and on-demand re-creation can
>> possibly work, as pointed out during (partly informal) discussion: At
>> the very least the allocated and pinned states of pages can only be
>> transferred.
> 
> Pinned state exists in the current migrate stream.  Allocated does not -
> it is an internal detail of how Xen handles the memory.
> 
> But yes - this observation means that we can't simply walk the guest
> pagetables.
> 
>> Hence we seem to have come to agreement that struct
>> page_info instances have to be transformed (in place if possible, i.e.
>> when the sizes match, otherwise by copying).
> 
> -10 to this idea, if it can possibly be avoided.  In this case, it
> definitely can be avoided.
> 
> We do not want to be grovelling around in the old Xen's datastructures,
> because that adds a binary A=>B translation which is
> per-old-version-of-xen, meaning that you need a custom build of each
> target Xen which depends on the currently-running Xen, or have to
> maintain a matrix of old versions which will be dependent on the local
> changes, and therefore not suitable for upstream.

Now the question is what alternative you would suggest. By you
saying "the pinned state lives in the migration stream", I assume
you mean to imply that Dom0 state should be handed from old to
new Xen via such a stream (minus raw data page contents)?

>>>> 		-> We might have to go and re-inject certain interrupts
>>> What hardware are you targeting here?  IvyBridge and later has a posted
>>> interrupt descriptor which can accumulate pending interrupts (at least
>>> manually), and newer versions (Broadwell?) can accumulate interrupts
>>> directly from hardware.
>> For HVM/PVH perhaps that's good enough. What about PV though?
> 
> What about PV?
> 
> The in-guest evtchn data structure will accumulate events just like a
> posted interrupt descriptor.  Real interrupts will queue in the LAPIC
> during the transition period.

Yes, that'll work as long as interrupts remain active from Xen's POV.
But if there's concern about a blackout period for HVM/PVH, then
surely there would also be such for PV.

>>>> A key cornerstone for Live-update is guest transparent live migration
>>>> 	-> This means we are using a well defined ABI for saving/restoring domain state
>>>> 		-> We do only rely on domain state and no internal xen state
>>> Absolutely.  One issue I discussed with David a while ago is that even
>>> across an upgrade of Xen, the format of the EPT/NPT pagetables might
>>> change, at least in terms of the layout of software bits.  (Especially
>>> for EPT where we slowly lose software bits to new hardware features we
>>> wish to use.)
>> Right, and therefore a similar transformation like for struct page_info
>> may be unavoidable here too.
> 
> None of that lives in the current migrate stream.  Again - it is
> internal details, so is not something which is appropriate to be
> inspected by the target Xen.
> 
>> Re-using large data structures (or arrays thereof) may also turn out
>> useful in terms of latency until the new Xen actually becomes ready to
>> resume.
> 
> When it comes to optimising the latency, there is a fair amount we might
> be able to do ahead of the critical region, but I still think this would
> be better done in terms of a "clean start" in the new Xen to reduce
> binary dependences.

Latency actually is only one aspect (albeit the larger the host, the more
relevant it is). Sufficient memory to have both old and new copies of the
data structures in place, plus the migration stream, is another. This
would especially become relevant when even DomU-s were to remain in
memory, rather than getting saved/restored.

Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-17 13:02       ` Jan Beulich
@ 2019-07-17 18:40         ` Andrew Cooper
  2019-07-18  9:15           ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2019-07-17 18:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Leonard Foerster

On 17/07/2019 14:02, Jan Beulich wrote:
> On 17.07.2019 13:26, Andrew Cooper wrote:
>> On 17/07/2019 08:09, Jan Beulich wrote:
>>> On 17.07.2019 01:51, Andrew Cooper wrote:
>>>> On 15/07/2019 19:57, Foerster, Leonard wrote:
>>>>> 	* dom0less: bootstrap domains without the involvement of dom0
>>>>> 		-> this might come in handy to at least setup and continue dom0 on target xen
>>>>> 		-> If we have this this might also enable us to de-serialize the state for
>>>>> 			other guest-domains in xen and not have to wait for dom0 to do this
>>>> Reconstruction of dom0 is something which Xen will definitely need to
>>>> do.  With the memory still in place, its just a fairly small of register
>>>> state which needs restoring.
>>>>
>>>> That said, reconstruction of the typerefs will be an issue.  Walking
>>>> over a fully populated L4 tree can (in theory) take minutes, and its not
>>>> safe to just start executing without reconstruction.
>>>>
>>>> Depending on how bad it is in practice, one option might be to do a
>>>> demand validate of %rip and %rsp, along with a hybrid shadow mode which
>>>> turns faults into typerefs, which would allow the gross cost of
>>>> revalidation to be amortised while the vcpus were executing.  We would
>>>> definitely want some kind of logic to aggressively typeref outstanding
>>>> pagetables so the shadow mode could be turned off.
>>> Neither walking the page table trees nor and on-demand re-creation can
>>> possibly work, as pointed out during (partly informal) discussion: At
>>> the very least the allocated and pinned states of pages can only be
>>> transferred.
>> Pinned state exists in the current migrate stream.  Allocated does not -
>> it is an internal detail of how Xen handles the memory.
>>
>> But yes - this observation means that we can't simply walk the guest
>> pagetables.
>>
>>> Hence we seem to have come to agreement that struct
>>> page_info instances have to be transformed (in place if possible, i.e.
>>> when the sizes match, otherwise by copying).
>> -10 to this idea, if it can possibly be avoided.  In this case, it
>> definitely can be avoided.
>>
>> We do not want to be grovelling around in the old Xen's datastructures,
>> because that adds a binary A=>B translation which is
>> per-old-version-of-xen, meaning that you need a custom build of each
>> target Xen which depends on the currently-running Xen, or have to
>> maintain a matrix of old versions which will be dependent on the local
>> changes, and therefore not suitable for upstream.
> Now the question is what alternative you would suggest. By you
> saying "the pinned state lives in the migration stream", I assume
> you mean to imply that Dom0 state should be handed from old to
> new Xen via such a stream (minus raw data page contents)?

Yes, and this in explicitly identified in the bullet point saying "We do
only rely on domain state and no internal xen state".

In practice, it is going to be far more efficient to have Xen
serialise/deserialise the domain register state etc, than to bounce it
via hypercalls.  By the time you're doing that in Xen, adding dom0 as
well is trivial.

>
>>>>> 		-> We might have to go and re-inject certain interrupts
>>>> What hardware are you targeting here?  IvyBridge and later has a posted
>>>> interrupt descriptor which can accumulate pending interrupts (at least
>>>> manually), and newer versions (Broadwell?) can accumulate interrupts
>>>> directly from hardware.
>>> For HVM/PVH perhaps that's good enough. What about PV though?
>> What about PV?
>>
>> The in-guest evtchn data structure will accumulate events just like a
>> posted interrupt descriptor.  Real interrupts will queue in the LAPIC
>> during the transition period.
> Yes, that'll work as long as interrupts remain active from Xen's POV.
> But if there's concern about a blackout period for HVM/PVH, then
> surely there would also be such for PV.

The only fix for that is to reduce the length of the blackout period. 
We can't magically inject interrupts half way through the xen-to-xen
transition, because we can't run vcpus at that point in time.

>
>>>>> A key cornerstone for Live-update is guest transparent live migration
>>>>> 	-> This means we are using a well defined ABI for saving/restoring domain state
>>>>> 		-> We do only rely on domain state and no internal xen state
>>>> Absolutely.  One issue I discussed with David a while ago is that even
>>>> across an upgrade of Xen, the format of the EPT/NPT pagetables might
>>>> change, at least in terms of the layout of software bits.  (Especially
>>>> for EPT where we slowly lose software bits to new hardware features we
>>>> wish to use.)
>>> Right, and therefore a similar transformation like for struct page_info
>>> may be unavoidable here too.
>> None of that lives in the current migrate stream.  Again - it is
>> internal details, so is not something which is appropriate to be
>> inspected by the target Xen.
>>
>>> Re-using large data structures (or arrays thereof) may also turn out
>>> useful in terms of latency until the new Xen actually becomes ready to
>>> resume.
>> When it comes to optimising the latency, there is a fair amount we might
>> be able to do ahead of the critical region, but I still think this would
>> be better done in terms of a "clean start" in the new Xen to reduce
>> binary dependences.
> Latency actually is only one aspect (albeit the larger the host, the more
> relevant it is). Sufficient memory to have both old and new copies of the
> data structures in place, plus the migration stream, is another. This
> would especially become relevant when even DomU-s were to remain in
> memory, rather than getting saved/restored.

But we're still talking about something which is on a multi-MB scale,
rather than multi-GB scale.

Xen itself is tiny.  Sure there are overheads from the heap management
and pagetables etc, but the the overwhelming majority of used memory is
guest RAM which is staying in place.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-16 22:27       ` Andrew Cooper
@ 2019-07-18  9:00         ` Juergen Gross
  2019-07-18  9:16           ` Paul Durrant
  2019-07-18  9:40           ` Roger Pau Monné
  0 siblings, 2 replies; 17+ messages in thread
From: Juergen Gross @ 2019-07-18  9:00 UTC (permalink / raw)
  To: Andrew Cooper, Sarah Newman, Foerster, Leonard, xen-devel

On 17.07.19 00:27, Andrew Cooper wrote:
> On 16/07/2019 05:20, Sarah Newman wrote:
>> On 7/15/19 8:48 PM, Juergen Gross wrote:
>>> On 15.07.19 21:31, Sarah Newman wrote:
>>>> On 7/15/19 11:57 AM, Foerster, Leonard wrote:
>>>> ...
>>>>> A key cornerstone for Live-update is guest transparent live migration
>>>> ...
>>>>>      -> for live migration: domid is a problem in this case
>>>>>          -> randomize and pray does not work on smaller fleets
>>>>>          -> this is not a problem for live-update
>>>>>          -> BUT: as a community we shoudl make this restriction go away
>>>>
>>>> Andrew Cooper pointed out to me that manually assigning domain IDs
>>>> is supported in much of the code already. If guest transparent live
>>>> migration gets merged, we'll look at passing in a domain ID to xl,
>>>> which would be good enough for us. I don't know about the other
>>>> toolstacks.
>>>
>>> The main problem is the case where on the target host the domid of the
>>> migrated domain is already in use by another domain. So you either need
>>> a domid allocator spanning all hosts or the change of domid during
>>> migration must be hidden from the guest for guest transparent migration.
>>
>> Yes. There are some cluster management systems which use xl rather
>> than xapi.
>> They could be extended to manage domain IDs if it's too difficult to
>> allow
>> the domain ID to change during migration.
> 
> For a v1 feature, having a restriction of "you must manage domids across
> the cluster" is a fine.  Guest-transparent migration is a very important
> feature, and one where we are lacking in relation to other hypervisors.
> 
> Longer term, we as the Xen community need to figure out a way to remove
> the dependency on domids, at which point the cluster-wide management
> restriction can be dropped.  This isn't going to be a trivial task, but
> it will be a worthwhile one.

Another problem are Xenstore watches. With guest transparent LM they are
lost today as there is currently no way to migrate them to the target 
Xenstore.

Live-Update could work around this issue via Xenstore-stubdom.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-17 18:40         ` Andrew Cooper
@ 2019-07-18  9:15           ` Jan Beulich
  2019-07-18 12:09             ` Amit Shah
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2019-07-18  9:15 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Leonard Foerster

On 17.07.2019 20:40, Andrew Cooper wrote:
> On 17/07/2019 14:02, Jan Beulich wrote:
>> On 17.07.2019 13:26, Andrew Cooper wrote:
>>> We do not want to be grovelling around in the old Xen's datastructures,
>>> because that adds a binary A=>B translation which is
>>> per-old-version-of-xen, meaning that you need a custom build of each
>>> target Xen which depends on the currently-running Xen, or have to
>>> maintain a matrix of old versions which will be dependent on the local
>>> changes, and therefore not suitable for upstream.
>> Now the question is what alternative you would suggest. By you
>> saying "the pinned state lives in the migration stream", I assume
>> you mean to imply that Dom0 state should be handed from old to
>> new Xen via such a stream (minus raw data page contents)?
> 
> Yes, and this in explicitly identified in the bullet point saying "We do
> only rely on domain state and no internal xen state".
> 
> In practice, it is going to be far more efficient to have Xen
> serialise/deserialise the domain register state etc, than to bounce it
> via hypercalls.  By the time you're doing that in Xen, adding dom0 as
> well is trivial.

So I must be missing some context here: How could hypercalls come into
the picture at all when it comes to "migrating" Dom0?

>>> The in-guest evtchn data structure will accumulate events just like a
>>> posted interrupt descriptor.  Real interrupts will queue in the LAPIC
>>> during the transition period.
>> Yes, that'll work as long as interrupts remain active from Xen's POV.
>> But if there's concern about a blackout period for HVM/PVH, then
>> surely there would also be such for PV.
> 
> The only fix for that is to reduce the length of the blackout period.
> We can't magically inject interrupts half way through the xen-to-xen
> transition, because we can't run vcpus at that point in time.

Hence David's proposal to "re-inject". We'd have to record them during
the blackout period, and inject once Dom0 is all set up again.

>>>> Re-using large data structures (or arrays thereof) may also turn out
>>>> useful in terms of latency until the new Xen actually becomes ready to
>>>> resume.
>>> When it comes to optimising the latency, there is a fair amount we might
>>> be able to do ahead of the critical region, but I still think this would
>>> be better done in terms of a "clean start" in the new Xen to reduce
>>> binary dependences.
>> Latency actually is only one aspect (albeit the larger the host, the more
>> relevant it is). Sufficient memory to have both old and new copies of the
>> data structures in place, plus the migration stream, is another. This
>> would especially become relevant when even DomU-s were to remain in
>> memory, rather than getting saved/restored.
> 
> But we're still talking about something which is on a multi-MB scale,
> rather than multi-GB scale.

On multi-TB systems frame_table[] is a multi-GB table. And with boot times
often scaling (roughly) with system size, live updating is (I guess) all
the more interesting on bigger systems.

Jan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-18  9:00         ` Juergen Gross
@ 2019-07-18  9:16           ` Paul Durrant
  2019-07-18  9:40           ` Roger Pau Monné
  1 sibling, 0 replies; 17+ messages in thread
From: Paul Durrant @ 2019-07-18  9:16 UTC (permalink / raw)
  To: 'Juergen Gross',
	Andrew Cooper, Sarah Newman, Foerster, Leonard, xen-devel

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Juergen Gross
> Sent: 18 July 2019 10:00
> To: Andrew Cooper <Andrew.Cooper3@citrix.com>; Sarah Newman <srn@prgmr.com>; Foerster, Leonard
> <foersleo@amazon.com>; xen-devel@lists.xenproject.org
> Subject: Re: [Xen-devel] Design session report: Live-Updating Xen
> 
> On 17.07.19 00:27, Andrew Cooper wrote:
> > On 16/07/2019 05:20, Sarah Newman wrote:
> >> On 7/15/19 8:48 PM, Juergen Gross wrote:
> >>> On 15.07.19 21:31, Sarah Newman wrote:
> >>>> On 7/15/19 11:57 AM, Foerster, Leonard wrote:
> >>>> ...
> >>>>> A key cornerstone for Live-update is guest transparent live migration
> >>>> ...
> >>>>>      -> for live migration: domid is a problem in this case
> >>>>>          -> randomize and pray does not work on smaller fleets
> >>>>>          -> this is not a problem for live-update
> >>>>>          -> BUT: as a community we shoudl make this restriction go away
> >>>>
> >>>> Andrew Cooper pointed out to me that manually assigning domain IDs
> >>>> is supported in much of the code already. If guest transparent live
> >>>> migration gets merged, we'll look at passing in a domain ID to xl,
> >>>> which would be good enough for us. I don't know about the other
> >>>> toolstacks.
> >>>
> >>> The main problem is the case where on the target host the domid of the
> >>> migrated domain is already in use by another domain. So you either need
> >>> a domid allocator spanning all hosts or the change of domid during
> >>> migration must be hidden from the guest for guest transparent migration.
> >>
> >> Yes. There are some cluster management systems which use xl rather
> >> than xapi.
> >> They could be extended to manage domain IDs if it's too difficult to
> >> allow
> >> the domain ID to change during migration.
> >
> > For a v1 feature, having a restriction of "you must manage domids across
> > the cluster" is a fine.  Guest-transparent migration is a very important
> > feature, and one where we are lacking in relation to other hypervisors.
> >
> > Longer term, we as the Xen community need to figure out a way to remove
> > the dependency on domids, at which point the cluster-wide management
> > restriction can be dropped.  This isn't going to be a trivial task, but
> > it will be a worthwhile one.
> 
> Another problem are Xenstore watches. With guest transparent LM they are
> lost today as there is currently no way to migrate them to the target
> Xenstore.
> 
> Live-Update could work around this issue via Xenstore-stubdom.

Watches are one problem. There's also the problem of pending transactions.

  Paul

> 
> 
> Juergen
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-15 18:57 [Xen-devel] Design session report: Live-Updating Xen Foerster, Leonard
  2019-07-15 19:31 ` Sarah Newman
  2019-07-16 23:51 ` Andrew Cooper
@ 2019-07-18  9:29 ` Paul Durrant
  2 siblings, 0 replies; 17+ messages in thread
From: Paul Durrant @ 2019-07-18  9:29 UTC (permalink / raw)
  To: 'Foerster, Leonard'; +Cc: xen-devel

> -----Original Message-----
[snip]
> 
> Longer term vision:
> 
> * Have a tiny hypervisor between Guest and Xen that handles the common cases
> 	-> this enables (almost) zero downtime for the guest
> 	-> the tiny hypervisor will maintain the guest while the underlying xen is kexecing into new
> build
> 

This sounds very much more like a KVM system... The majority of Xen becomes the 'kernel and QEMU' part (i.e. it's the part that boot-straps the system, deals with IOMMU, APICs, etc. and incorporates the scheduler) and the tiny hypervisor is the 'kvm.ko' (deals with basic I/O and instruction emulation traps). Is that the general split you envisage?

  Paul

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-18  9:00         ` Juergen Gross
  2019-07-18  9:16           ` Paul Durrant
@ 2019-07-18  9:40           ` Roger Pau Monné
  2019-07-18  9:43             ` Juergen Gross
  1 sibling, 1 reply; 17+ messages in thread
From: Roger Pau Monné @ 2019-07-18  9:40 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Andrew Cooper, xen-devel, Foerster, Leonard, Sarah Newman

On Thu, Jul 18, 2019 at 11:00:23AM +0200, Juergen Gross wrote:
> On 17.07.19 00:27, Andrew Cooper wrote:
> > On 16/07/2019 05:20, Sarah Newman wrote:
> > > On 7/15/19 8:48 PM, Juergen Gross wrote:
> > > > On 15.07.19 21:31, Sarah Newman wrote:
> > > > > On 7/15/19 11:57 AM, Foerster, Leonard wrote:
> > > > > ...
> > > > > > A key cornerstone for Live-update is guest transparent live migration
> > > > > ...
> > > > > >      -> for live migration: domid is a problem in this case
> > > > > >          -> randomize and pray does not work on smaller fleets
> > > > > >          -> this is not a problem for live-update
> > > > > >          -> BUT: as a community we shoudl make this restriction go away
> > > > > 
> > > > > Andrew Cooper pointed out to me that manually assigning domain IDs
> > > > > is supported in much of the code already. If guest transparent live
> > > > > migration gets merged, we'll look at passing in a domain ID to xl,
> > > > > which would be good enough for us. I don't know about the other
> > > > > toolstacks.
> > > > 
> > > > The main problem is the case where on the target host the domid of the
> > > > migrated domain is already in use by another domain. So you either need
> > > > a domid allocator spanning all hosts or the change of domid during
> > > > migration must be hidden from the guest for guest transparent migration.
> > > 
> > > Yes. There are some cluster management systems which use xl rather
> > > than xapi.
> > > They could be extended to manage domain IDs if it's too difficult to
> > > allow
> > > the domain ID to change during migration.
> > 
> > For a v1 feature, having a restriction of "you must manage domids across
> > the cluster" is a fine.  Guest-transparent migration is a very important
> > feature, and one where we are lacking in relation to other hypervisors.
> > 
> > Longer term, we as the Xen community need to figure out a way to remove
> > the dependency on domids, at which point the cluster-wide management
> > restriction can be dropped.  This isn't going to be a trivial task, but
> > it will be a worthwhile one.
> 
> Another problem are Xenstore watches. With guest transparent LM they are
> lost today as there is currently no way to migrate them to the target
> Xenstore.

Hm, I guess I'm missing something, but xenstored running either in
dom0 or in a stubdomain should be completely unaware of the hypervisor
being updated under it's feet. The hypervisor itself don't have any
knowledge itself of xenstore state.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-18  9:40           ` Roger Pau Monné
@ 2019-07-18  9:43             ` Juergen Gross
  0 siblings, 0 replies; 17+ messages in thread
From: Juergen Gross @ 2019-07-18  9:43 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Sarah Newman, Leonard Foerster, xen-devel

On 18.07.19 11:40, Roger Pau Monné  wrote:
> On Thu, Jul 18, 2019 at 11:00:23AM +0200, Juergen Gross wrote:
>> On 17.07.19 00:27, Andrew Cooper wrote:
>>> On 16/07/2019 05:20, Sarah Newman wrote:
>>>> On 7/15/19 8:48 PM, Juergen Gross wrote:
>>>>> On 15.07.19 21:31, Sarah Newman wrote:
>>>>>> On 7/15/19 11:57 AM, Foerster, Leonard wrote:
>>>>>> ...
>>>>>>> A key cornerstone for Live-update is guest transparent live migration
>>>>>> ...
>>>>>>>       -> for live migration: domid is a problem in this case
>>>>>>>           -> randomize and pray does not work on smaller fleets
>>>>>>>           -> this is not a problem for live-update
>>>>>>>           -> BUT: as a community we shoudl make this restriction go away
>>>>>>
>>>>>> Andrew Cooper pointed out to me that manually assigning domain IDs
>>>>>> is supported in much of the code already. If guest transparent live
>>>>>> migration gets merged, we'll look at passing in a domain ID to xl,
>>>>>> which would be good enough for us. I don't know about the other
>>>>>> toolstacks.
>>>>>
>>>>> The main problem is the case where on the target host the domid of the
>>>>> migrated domain is already in use by another domain. So you either need
>>>>> a domid allocator spanning all hosts or the change of domid during
>>>>> migration must be hidden from the guest for guest transparent migration.
>>>>
>>>> Yes. There are some cluster management systems which use xl rather
>>>> than xapi.
>>>> They could be extended to manage domain IDs if it's too difficult to
>>>> allow
>>>> the domain ID to change during migration.
>>>
>>> For a v1 feature, having a restriction of "you must manage domids across
>>> the cluster" is a fine.  Guest-transparent migration is a very important
>>> feature, and one where we are lacking in relation to other hypervisors.
>>>
>>> Longer term, we as the Xen community need to figure out a way to remove
>>> the dependency on domids, at which point the cluster-wide management
>>> restriction can be dropped.  This isn't going to be a trivial task, but
>>> it will be a worthwhile one.
>>
>> Another problem are Xenstore watches. With guest transparent LM they are
>> lost today as there is currently no way to migrate them to the target
>> Xenstore.
> 
> Hm, I guess I'm missing something, but xenstored running either in
> dom0 or in a stubdomain should be completely unaware of the hypervisor
> being updated under it's feet. The hypervisor itself don't have any
> knowledge itself of xenstore state.

Oh, right, I was thinking about guest transparent LM first and then
widened the scope to Live Update.

So this is a problem for guest transparent LM only.


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Xen-devel] Design session report: Live-Updating Xen
  2019-07-18  9:15           ` Jan Beulich
@ 2019-07-18 12:09             ` Amit Shah
  0 siblings, 0 replies; 17+ messages in thread
From: Amit Shah @ 2019-07-18 12:09 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: xen-devel, dwmw2, Leonard Foerster

On Thu, 2019-07-18 at 09:15 +0000, Jan Beulich wrote:
> On 17.07.2019 20:40, Andrew Cooper wrote:
> > On 17/07/2019 14:02, Jan Beulich wrote:
> > > On 17.07.2019 13:26, Andrew Cooper wrote:
> > > > We do not want to be grovelling around in the old Xen's
> > > > datastructures,
> > > > because that adds a binary A=>B translation which is
> > > > per-old-version-of-xen, meaning that you need a custom build of
> > > > each
> > > > target Xen which depends on the currently-running Xen, or have
> > > > to
> > > > maintain a matrix of old versions which will be dependent on
> > > > the local
> > > > changes, and therefore not suitable for upstream.
> > > 
> > > Now the question is what alternative you would suggest. By you
> > > saying "the pinned state lives in the migration stream", I assume
> > > you mean to imply that Dom0 state should be handed from old to
> > > new Xen via such a stream (minus raw data page contents)?
> > 
> > Yes, and this in explicitly identified in the bullet point saying
> > "We do
> > only rely on domain state and no internal xen state".
> > 
> > In practice, it is going to be far more efficient to have Xen
> > serialise/deserialise the domain register state etc, than to bounce
> > it
> > via hypercalls.  By the time you're doing that in Xen, adding dom0
> > as
> > well is trivial.
> 
> So I must be missing some context here: How could hypercalls come
> into
> the picture at all when it comes to "migrating" Dom0?

Xen will have to orchestrate the "save/restore" aspects of the domains
here.  The flow roughly will be:

1. One hypercall to load the new Xen binary in memory
2. Another hypercall to:
  a. Pause domains (including dom0),
  b. Mask interrupts,
  c. Serialize state,
  c. kexec into new Xen binary, and deserialize state

We had briefly considered Dom0 (or another stub domain) orchestrating
the whole serializing aspect here, but that's just too slow and will
create more problems in practice, so the idea was quickly dumped.

> 
> > > > The in-guest evtchn data structure will accumulate events just
> > > > like a
> > > > posted interrupt descriptor.  Real interrupts will queue in the
> > > > LAPIC
> > > > during the transition period.
> > > 
> > > Yes, that'll work as long as interrupts remain active from Xen's
> > > POV.
> > > But if there's concern about a blackout period for HVM/PVH, then
> > > surely there would also be such for PV.
> > 
> > The only fix for that is to reduce the length of the blackout
> > period.
> > We can't magically inject interrupts half way through the xen-to-
> > xen
> > transition, because we can't run vcpus at that point in time.
> 
> Hence David's proposal to "re-inject". We'd have to record them
> during
> the blackout period, and inject once Dom0 is all set up again.

We'll need both: as less downtime as possible, and to later re-inject
interrupts when domains continue execution.  The fewer reinjections we
have to do the better; but overall, the less visible this maintenance
activity the better as well.

> 
> > > > > Re-using large data structures (or arrays thereof) may also
> > > > > turn out
> > > > > useful in terms of latency until the new Xen actually becomes
> > > > > ready to
> > > > > resume.
> > > > 
> > > > When it comes to optimising the latency, there is a fair amount
> > > > we might
> > > > be able to do ahead of the critical region, but I still think
> > > > this would
> > > > be better done in terms of a "clean start" in the new Xen to
> > > > reduce
> > > > binary dependences.
> > > 
> > > Latency actually is only one aspect (albeit the larger the host,
> > > the more
> > > relevant it is). Sufficient memory to have both old and new
> > > copies of the
> > > data structures in place, plus the migration stream, is another.
> > > This
> > > would especially become relevant when even DomU-s were to remain
> > > in
> > > memory, rather than getting saved/restored.
> > 
> > But we're still talking about something which is on a multi-MB
> > scale,
> > rather than multi-GB scale.
> 
> On multi-TB systems frame_table[] is a multi-GB table. And with boot
> times
> often scaling (roughly) with system size, live updating is (I guess)
> all
> the more interesting on bigger systems.

We've not yet had to closely look at all these things yet - but this is
also perhaps the only point David and I keep quibbling about - will
there be any Xen state, and will we need to do anything about it?  The
ideal thing will be to have the new Xen start from scratch.  There's an
alternative idea here, though: *if* this is only during the setup phase
of the new Xen binary, we can perhaps get the allocations done before
pausing domains (i.e. in step 1 above).  That saves us time.  How this
works for memory, and how much free memory we can expect to have, is a
question that can only be answered at runtime.  Ideally we don't want
to leave such systems behind.  So, getting creative with
serializing/deserializing such state is something I totally anticipate
having to do.  But don't tell David I said it once again...


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-07-18 12:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-15 18:57 [Xen-devel] Design session report: Live-Updating Xen Foerster, Leonard
2019-07-15 19:31 ` Sarah Newman
2019-07-16  3:48   ` Juergen Gross
2019-07-16  4:20     ` Sarah Newman
2019-07-16 22:27       ` Andrew Cooper
2019-07-18  9:00         ` Juergen Gross
2019-07-18  9:16           ` Paul Durrant
2019-07-18  9:40           ` Roger Pau Monné
2019-07-18  9:43             ` Juergen Gross
2019-07-16 23:51 ` Andrew Cooper
2019-07-17  7:09   ` Jan Beulich
2019-07-17 11:26     ` Andrew Cooper
2019-07-17 13:02       ` Jan Beulich
2019-07-17 18:40         ` Andrew Cooper
2019-07-18  9:15           ` Jan Beulich
2019-07-18 12:09             ` Amit Shah
2019-07-18  9:29 ` Paul Durrant

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.