xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Xen-devel] Xen 4.14 and future work
@ 2019-12-02 19:51 Andrew Cooper
  2019-12-03  9:03 ` Durrant, Paul
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Andrew Cooper @ 2019-12-02 19:51 UTC (permalink / raw)
  To: Xen-devel List

Hello,

Now that 4.13 is on its way out of the door, it is time to look to
ongoing work.

We have a large backlog of speculation-related work.  For one, we still
don't virtualise MSR_ARCH_CAPS for guests, or use eIBRS ourselves in
Xen.  Therefore, while Xen does function on Cascade Lake, support is
distinctly suboptimal.

Similarly, AMD systems frequently fill /var/log with:

(XEN) emul-priv-op.c:1113:d0v13 Domain attempted WRMSR c0011020 from
0x0006404000000000 to 0x0006404000000400

which is an interaction Linux's prctl() to disable memory disambiguation
on a per-process basis, Xen's write/discard behaviour for MSRs, and the
long-overdue series to properly virtualise SSBD support on AMD
hardware.  AMD Rome hardware, like Cascade Lake, has certain hardware
speculative mitigation features which need virtualising for guests to
make use of.


Similarly, there is plenty more work to do with core-aware scheduling,
and from my side of things, sane guest topology.  This will eventually
unblock one of the factors on the hard 128 vcpu limit for HVM guests.


Another big area is the stability of toolstack hypercalls.  This is a
crippling pain point for distros and upgradeability of systems, and
there is frankly no justifiable reason for the way we currently do
things  The real reason is inertia from back in the days when Xen.git
(bitkeeper as it was back then) contained a fork of every relevant
pieces of software, but this a long-since obsolete model, but still
causing us pain.  I will follow up with a proposal in due course, but as
a oneliner, it will build on the dm_op() API model.

Likely included within this is making the domain/vcpu destroy paths
idempotent so we can fix a load of NULL pointer dereferences in Xen
caused by XEN_DOMCTL_max_vcpus not being part of XEN_DOMCTL_createdomain.

Other work in this area involves adding X86_EMUL_{VIRIDIAN,NESTED_VIRT}
to replace their existing problematic enablement interfaces.


A start needs to be made on a total rethink of the HVM ABI.  This has
come up repeatedly at previous dev summits, and is in desperate need of
having some work started on it.


Other areas in need of work is the boot time directmap at 0 (which hides
NULL pointer deferences during boot), and the correct handling of %dr6
for all kinds of guests.


Anyway, that's probably a good enough summary for now. 
Thoughts/comments welcome, especially if something on this list happens
to be a priority elsewhere and engineering effort can be put towards it.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Xen 4.14 and future work
  2019-12-02 19:51 [Xen-devel] Xen 4.14 and future work Andrew Cooper
@ 2019-12-03  9:03 ` Durrant, Paul
  2019-12-03 17:37   ` Andrew Cooper
  2019-12-05 15:30 ` Andrew Cooper
  2020-01-09 15:36 ` Xia, Hongyan
  2 siblings, 1 reply; 7+ messages in thread
From: Durrant, Paul @ 2019-12-03  9:03 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel List

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Andrew Cooper
> Sent: 02 December 2019 19:52
> To: Xen-devel List <xen-devel@lists.xen.org>
> Subject: [Xen-devel] Xen 4.14 and future work
> 
> Hello,
> 
> Now that 4.13 is on its way out of the door, it is time to look to
> ongoing work.
> 
> We have a large backlog of speculation-related work.  For one, we still
> don't virtualise MSR_ARCH_CAPS for guests, or use eIBRS ourselves in
> Xen.  Therefore, while Xen does function on Cascade Lake, support is
> distinctly suboptimal.
> 
> Similarly, AMD systems frequently fill /var/log with:
> 
> (XEN) emul-priv-op.c:1113:d0v13 Domain attempted WRMSR c0011020 from
> 0x0006404000000000 to 0x0006404000000400
> 
> which is an interaction Linux's prctl() to disable memory disambiguation
> on a per-process basis, Xen's write/discard behaviour for MSRs, and the
> long-overdue series to properly virtualise SSBD support on AMD
> hardware.  AMD Rome hardware, like Cascade Lake, has certain hardware
> speculative mitigation features which need virtualising for guests to
> make use of.
> 

I assume this would addressed by the proposed cpuid/msr policy work? I think it is quite vital for Xen that we are able to migrate guests across pools of heterogeneous h/w and therefore I'd like to see this done in 4.14 if possible.

> 
> Similarly, there is plenty more work to do with core-aware scheduling,
> and from my side of things, sane guest topology.  This will eventually
> unblock one of the factors on the hard 128 vcpu limit for HVM guests.
> 
> 
> Another big area is the stability of toolstack hypercalls.  This is a
> crippling pain point for distros and upgradeability of systems, and
> there is frankly no justifiable reason for the way we currently do
> things  The real reason is inertia from back in the days when Xen.git
> (bitkeeper as it was back then) contained a fork of every relevant
> pieces of software, but this a long-since obsolete model, but still
> causing us pain.  I will follow up with a proposal in due course, but as
> a oneliner, it will build on the dm_op() API model.

This is also fairly vital for the work on live update of Xen (as discussed at the last dev summit). Any instability in the tools ABI will compromise hypervisor update and fixing such issues on an ad-hoc basis as they arise is not really a desirable prospect.

> 
> Likely included within this is making the domain/vcpu destroy paths
> idempotent so we can fix a load of NULL pointer dereferences in Xen
> caused by XEN_DOMCTL_max_vcpus not being part of XEN_DOMCTL_createdomain.
> 
> Other work in this area involves adding X86_EMUL_{VIRIDIAN,NESTED_VIRT}
> to replace their existing problematic enablement interfaces.
> 

I think this should include deprecation of HVMOP_get/set_param as far as is possible (i.e. tools use)...

> 
> A start needs to be made on a total rethink of the HVM ABI.  This has
> come up repeatedly at previous dev summits, and is in desperate need of
> having some work started on it.
> 

...and completely in any new ABI.

I wonder to what extent we can provide a guest-side compat layer here, otherwise it would be hard to get traction I think.
There was an interesting talk at KVM Forum (https://sched.co/Tmuy) on dealing with emulation inside guest context by essentially re-injecting the VMEXITs back into the guest for pseudo-SMM code (loaded as part of the firmware blob) to deal with. I could imagine potentially using such a mechanism to have a 'legacy' hypercall translated to the new ABI, which would allow older guests to be supported unmodified (albeit with a performance penalty). Such a mechanism may also be useful as an alternative way of dealing with some of the emulation dealt with directly in Xen at the moment, to reduce the hypervisor attack surface e.g. stdvga caching, hpet, rtc... perhaps.

Cheers,

  Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Xen 4.14 and future work
  2019-12-03  9:03 ` Durrant, Paul
@ 2019-12-03 17:37   ` Andrew Cooper
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2019-12-03 17:37 UTC (permalink / raw)
  To: Durrant, Paul, Xen-devel List

On 03/12/2019 09:03, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
>> Andrew Cooper
>> Sent: 02 December 2019 19:52
>> To: Xen-devel List <xen-devel@lists.xen.org>
>> Subject: [Xen-devel] Xen 4.14 and future work
>>
>> Hello,
>>
>> Now that 4.13 is on its way out of the door, it is time to look to
>> ongoing work.
>>
>> We have a large backlog of speculation-related work.  For one, we still
>> don't virtualise MSR_ARCH_CAPS for guests, or use eIBRS ourselves in
>> Xen.  Therefore, while Xen does function on Cascade Lake, support is
>> distinctly suboptimal.
>>
>> Similarly, AMD systems frequently fill /var/log with:
>>
>> (XEN) emul-priv-op.c:1113:d0v13 Domain attempted WRMSR c0011020 from
>> 0x0006404000000000 to 0x0006404000000400
>>
>> which is an interaction Linux's prctl() to disable memory disambiguation
>> on a per-process basis, Xen's write/discard behaviour for MSRs, and the
>> long-overdue series to properly virtualise SSBD support on AMD
>> hardware.  AMD Rome hardware, like Cascade Lake, has certain hardware
>> speculative mitigation features which need virtualising for guests to
>> make use of.
>>
> I assume this would addressed by the proposed cpuid/msr policy work?

Yes.  The next task there is to plumb the CPUID policy through the libxc
migrate stream, coping with its absence from older sources.  This
(purposefully) breaks the dual purpose of the CPUID code in libxc for
both domain start and domain restore, and allows us to rewrite the
domain start logic without impacting migrating-in VMs.

Then, and only then, is it safe to add MSR_ARCH_CAPS into the guest
policies and start setting it up.

> I think it is quite vital for Xen that we are able to migrate guests across pools of heterogeneous h/w and therefore I'd like to see this done in 4.14 if possible.

Why do you think it was top of my list :)

>
>> Similarly, there is plenty more work to do with core-aware scheduling,
>> and from my side of things, sane guest topology.  This will eventually
>> unblock one of the factors on the hard 128 vcpu limit for HVM guests.
>>
>>
>> Another big area is the stability of toolstack hypercalls.  This is a
>> crippling pain point for distros and upgradeability of systems, and
>> there is frankly no justifiable reason for the way we currently do
>> things  The real reason is inertia from back in the days when Xen.git
>> (bitkeeper as it was back then) contained a fork of every relevant
>> pieces of software, but this a long-since obsolete model, but still
>> causing us pain.  I will follow up with a proposal in due course, but as
>> a oneliner, it will build on the dm_op() API model.
> This is also fairly vital for the work on live update of Xen (as discussed at the last dev summit). Any instability in the tools ABI will compromise hypervisor update and fixing such issues on an ad-hoc basis as they arise is not really a desirable prospect.
>
>> Likely included within this is making the domain/vcpu destroy paths
>> idempotent so we can fix a load of NULL pointer dereferences in Xen
>> caused by XEN_DOMCTL_max_vcpus not being part of XEN_DOMCTL_createdomain.
>>
>> Other work in this area involves adding X86_EMUL_{VIRIDIAN,NESTED_VIRT}
>> to replace their existing problematic enablement interfaces.
>>
> I think this should include deprecation of HVMOP_get/set_param as far as is possible (i.e. tools use)...
>
>> A start needs to be made on a total rethink of the HVM ABI.  This has
>> come up repeatedly at previous dev summits, and is in desperate need of
>> having some work started on it.
>>
> ...and completely in any new ABI.

Both already in the plan(s).

> I wonder to what extent we can provide a guest-side compat layer here, otherwise it would be hard to get traction I think.

Step 1 of the design (deliberately) won't be concerned with guest
compatibility.  The single most important aspect is to come up with a
clean design which is not crippled by retaining compatibility for PV
guests, and without x86-isms leaking into other architectures.

Once a sensible design exists, we can go about figuring out how best to
enact it.  Most areas will be able to fit compatibility into existing
HVM guests, but some are going to have a very hard time.

> There was an interesting talk at KVM Forum (https://sched.co/Tmuy) on dealing with emulation inside guest context by essentially re-injecting the VMEXITs back into the guest for pseudo-SMM code (loaded as part of the firmware blob) to deal with. I could imagine potentially using such a mechanism to have a 'legacy' hypercall translated to the new ABI, which would allow older guests to be supported unmodified (albeit with a performance penalty). Such a mechanism may also be useful as an alternative way of dealing with some of the emulation dealt with directly in Xen at the moment, to reduce the hypervisor attack surface e.g. stdvga caching, hpet, rtc... perhaps.

I don't think this is relevant to the ABI discussion - its not changing
anything in guest view.  I'm sure people will want it for other reasons,
and I don't see any issue with implementing it for existing HVM guests.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Xen 4.14 and future work
  2019-12-02 19:51 [Xen-devel] Xen 4.14 and future work Andrew Cooper
  2019-12-03  9:03 ` Durrant, Paul
@ 2019-12-05 15:30 ` Andrew Cooper
  2019-12-06 10:37   ` Durrant, Paul
  2020-01-09 15:36 ` Xia, Hongyan
  2 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2019-12-05 15:30 UTC (permalink / raw)
  To: Xen-devel List

On 02/12/2019 19:51, Andrew Cooper wrote:
> Hello,
>
> Now that 4.13 is on its way out of the door, it is time to look to
> ongoing work.

(and some more...)

* Shim: Removal of an M2P.

Within the shim, two M2P's are constructed, and one of them is entirely
unused.  Both take up a decent chunk of memory, an contribute to reduced
packing density

By inspecting the kernel width earlier during boot, we can avoid
creating the unneeded M2P.  This would require teaching Xen to operate
with only a compat p2m when running a 32bit guest, but this will be fine
to run with.

* CONFIG_PV32

Despite being deprecated in 64bit processors, we still use ring 1 for
32bit PV kernels.  A relic of a bygone era, this causes problems with
newer pagetable based features (SMEP and SMAP in particular), resulting
in complexity in the entry paths and a substantial performance penalty
(a CR4 write on the way in and out of 32bit guests).  There are also
some speculative mitigation actions we take unilaterally, just because
32bit PV guest kernels run on supervisor mappings rather than user
mappings (RSB-stuffing being the most obvious one).

From an attack surface point of view, being able to remove all ring 1
facilities would be great for deployments not intending to run 32bit PV
guests (and/or relegate them to PV-shim), and there will be some
(probably minor) performance gains for 64bit PV guests as a consequence
of the simplified entry paths.

(And who knows...  A combination of this and Paul's idea for emulation
thunk-ing in a pseudo SMM-like mode inside HVM guests could remove the
need for CONFIG_COMPAT entirely at L0.)

* CONFIG_PDX

Here is one I prepared earlier. 
https://andrewcoop-xen.readthedocs.io/en/docs-devel/misc/tech-debt.html#config-pdx

* CONFIG_$VENDOR

For restricted usecases, dropping one of AMD or INTEL would drop a
substantial chunk of code, and a sufficiently capable LTO compiler with
devirtualisation support could even remove the function pointers themselves.

More usefully to the general project however, would be RANDCONFIG's
ability to check and keep our vendor-specific interfaces clean.  They
most definitely are not right now.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Xen 4.14 and future work
  2019-12-05 15:30 ` Andrew Cooper
@ 2019-12-06 10:37   ` Durrant, Paul
  2019-12-06 10:39     ` Andrew Cooper
  0 siblings, 1 reply; 7+ messages in thread
From: Durrant, Paul @ 2019-12-06 10:37 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel List

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Andrew Cooper
> Sent: 05 December 2019 15:31
> To: Xen-devel List <xen-devel@lists.xen.org>
> Subject: Re: [Xen-devel] Xen 4.14 and future work
> 
> On 02/12/2019 19:51, Andrew Cooper wrote:
> > Hello,
> >
> > Now that 4.13 is on its way out of the door, it is time to look to
> > ongoing work.
> 
[snip]

/me remembers something else...

ISTR work was being done to replace minios stubdoms with something more modern. Is this continuing? AFAIK we are really only keeping qemu trad alive for stubdoms and it would be nice if we could finally retire it.

  Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Xen 4.14 and future work
  2019-12-06 10:37   ` Durrant, Paul
@ 2019-12-06 10:39     ` Andrew Cooper
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2019-12-06 10:39 UTC (permalink / raw)
  To: Durrant, Paul, Xen-devel List

On 06/12/2019 10:37, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
>> Andrew Cooper
>> Sent: 05 December 2019 15:31
>> To: Xen-devel List <xen-devel@lists.xen.org>
>> Subject: Re: [Xen-devel] Xen 4.14 and future work
>>
>> On 02/12/2019 19:51, Andrew Cooper wrote:
>>> Hello,
>>>
>>> Now that 4.13 is on its way out of the door, it is time to look to
>>> ongoing work.
> [snip]
>
> /me remembers something else...
>
> ISTR work was being done to replace minios stubdoms with something more modern. Is this continuing? AFAIK we are really only keeping qemu trad alive for stubdoms and it would be nice if we could finally retire it.

That will be down to unikraft, and whether it is sufficiently capable to
build a replacement to the qemu stubdom.

But yes - being able to kill qemu-trad for good would be a great step
forwards.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Xen 4.14 and future work
  2019-12-02 19:51 [Xen-devel] Xen 4.14 and future work Andrew Cooper
  2019-12-03  9:03 ` Durrant, Paul
  2019-12-05 15:30 ` Andrew Cooper
@ 2020-01-09 15:36 ` Xia, Hongyan
  2 siblings, 0 replies; 7+ messages in thread
From: Xia, Hongyan @ 2020-01-09 15:36 UTC (permalink / raw)
  To: andrew.cooper3, xen-devel

On Mon, 2019-12-02 at 19:51 +0000, Andrew Cooper wrote:
> ...
> 
> Other areas in need of work is the boot time directmap at 0 (which
> hides
> NULL pointer deferences during boot), and the correct handling of
> %dr6
> for all kinds of guests.
> 

Sorry for the late reply to this thread. Talking about the directmap,
will we have time and engineering power to look at the direct map
removal series? I have been maintaining, testing and benchmarking it
since late 4.13 and it would be nice to have (part of) it in 4.14.

The bulk of the series is moving things from xenheap to domheap site by
site, which should not be difficult to review. The direct map is only
removed at the final stage of the series, so before that comes in,
domheap is still mapped via the direct map and there will not even be
performance differences. I am thinking that we could at least merge
most of the transitions to domheap in 4.14.

Hongyan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-09 15:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-02 19:51 [Xen-devel] Xen 4.14 and future work Andrew Cooper
2019-12-03  9:03 ` Durrant, Paul
2019-12-03 17:37   ` Andrew Cooper
2019-12-05 15:30 ` Andrew Cooper
2019-12-06 10:37   ` Durrant, Paul
2019-12-06 10:39     ` Andrew Cooper
2020-01-09 15:36 ` Xia, Hongyan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).