All of lore.kernel.org
 help / color / mirror / Atom feed
* XenProject/XenServer QEMU working group, Friday 8th July, 2016, 15:00.
@ 2016-07-28 17:01 Jennifer Herbert
  2016-08-01 11:32 ` Device model operation hypercall (DMOP, re qemu depriv) Ian Jackson
  2016-09-09 16:18 ` XenProject/XenServer QEMU working group minutes, 30th August 2016 Jennifer Herbert
  0 siblings, 2 replies; 74+ messages in thread
From: Jennifer Herbert @ 2016-07-28 17:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Ian Jackson, David Vrabel

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: qemu-col-notes.txt --]
[-- Type: text/plain, Size: 6135 bytes --]

XenProject/XenServer QEMU working group, Friday 8th July, 2016, 15:00.


Date and Attendees
==================

XenProject/XenServer QEMU working group, held on Friday the 8th of July,
2016 at 15:00.

The Following were present:

Ian Jackson [platform team]
David Vrabel [ring0]
Andrew Cooper [ring0]
Jennifer Herbert [ring0]


Purpose of meeting
==================

Both XenServer (currently using qemu-trad with de-priv) and XenProject
(using QEMU (upstream) without dep-priv), would like to move to using
QEMU (upstream) de-privelaged.

This meeting was intended to restart XenServer/ring0 and XenProject's
collaboration.

Agenda includes:

* Discuss requirements
* Review the our status and strategy on achieving these requirements.
* Agree next steps.

Meeting Actions
===============

Ian: Find all the Xenstore keys used by QEMU, and evaluate how much work
     it would be to either read before privileges are dropped, or
     otherwise stop reading/writing them when de-privelaged.

Ian: Write up DM_Opp design (discussed during meeting), talk to Jan
     about this.

XenServer: Go though All the priv-cmd ops, check how they would fit
     within the dm-opp design.

David: Post Event channel / priv command patch to upstream.  (Now done)


Meeting Transcript - abridged
=============================


Discussed Goals
---------------

Everyone agrees we want to stop anything gaining control of QEMU from
also gaining access to the platform.

XenProject would like to use depriv by default.  Should not preclude
other configurations such as PCI pass though.

Ian: There is a project in progress to use stub domains with QEMU –
     would expect a user to run depriv or stubdoms, but not at the same
     time.
XS: states that is does not want to go down the stub domains route, as
    is not considered scalable. (Boot storms etc;)

Ian: In upstream we expect to pursue both depriv qemu, and stub qemu, as
   implementation strategies.  Depriv qemu is probably deliverable
   sooner.  Stub qemu is more secure but not suitable for all users
   (eg, maybe not for XenServer as David suggested).


Going though list technology areas listed in document shared beforehand.

Disk / Nsetwork / Host IO
-------------------------

Can be addressed by running as unprivileged user.  Andrew: Rather
linux centric - not posix solution for all these problem:  Ian: There
are similar interfaces on other platforms.  In upstream we expect to at
least be able to run as an unprivileged user, which is indeed portable.

Mem map
------------

XenServer has a solution.  Ian: Would like to see this shared
immediately.  Andrew: Consider rest of design first.

HyperCalls
----------

Three options:
    1: XSM
    2: Fix ABIan:  Re-arrange priv-cmd to make it easier to parse and
       restrict.
    3: Loadable pattern matching solution.  All agree this would be
       horrible to maintain.

Ian: Difference between not having Stable API, and the instability of
     ABI stopping the determination of domain.
Ian: Version compatibility out of scope.  Andrew:  Disagree, should do
     both together.

Andrew: Requirement: – no interdependence between QEMU and the kernel,
        such that you need to update both together.

All: Agree xen interface should not be embedded in the kernel.

Jen: Asked why Ian was against using XSM.
Ian: XSM wholesale replace existing checks, and current XSM polices are
     unproven.
Andrew: XS is experimenting, and there are problems.
Ian & Andrew: Agree would need auditing.
Ian: Reluctant to link xsm to this work.
David: Can do both – XSM and 'Fix ABI'.
Ian: With both, wouldn't need the nesting.
David: Would still need to nesting to wrap domains.
Ian: Doesn't have to do all ….
David: <Unconvinced>
David: (wearing a linux hat): Sceptical for stable API – don't want to
have a new kernel when API new features added.
Ian: Suggest ideas of DM-Space – D-ctrl space.
David: Would need to be a strong component.  Both ends would want
       commitment ~ no snow flakes.
Ian: Shouldn't be a ploblem.
Andrew: PVH would relpcaes QEMU.   Ian:  Not relevant.

Idea of DM-ops is discussed more.
    Dm-op restricted to domain
    No need to version this.
    Important fields (domain) fixed, such that DM-ops can be added
    changed, without affecting kernel interface.

Andrew bring up XenGT.   Is this considered part of QEMU.  QEMU
shouldn't set pci bars.
Question if DM-OP would just be HVM-OP.
David: Should enumerate hyper-calls, make sure in DM-ops.
Andrew & Ian : would need to discuss with Jan.
Some things may need to be in both dm-op & other things, ie guest
interface. - This is ok.

dm-ops is tentatively agreed on as a way forward.

XenStore Restrict.
------------------

Jen : Upstream discussing maybe dropping sockets.  (Necessary for
    restrict)
David: Problem, is it gives the same permissions as the guest – does
    this restrict too much?

Should look though all uses, see if they can be used before it drops
privileges.

Ian: Already looked at QEMU PV backend/DM spilt work.   - This work was
     awkard disjoint work.  I think this approach would work.

Jen: Brought up Physmap keys in xenstore…  All agree this isn't nice.
Andrew: Mentions memory accounts swamp he wants to sort out.
Others: - don't want to tie this work to that.

Ian: Guest can blow of its own head – physmap not great, but ok.

Need to get Xenstore keys, so see issues.

Ian: Happy to say everyone should move to xenstore-d, but need to find
     way to multiplex this, such that it can be used from stub domains.

Need some sort of plan.
New ring – not good idea.  Alternative plan, prefix command.
Andrew: Protocol command?
David: not like that idea.

Ian: Huge patch needed to separate PV-QEMU from DM part, running in
     separate qemus.
David: XenSever has no PV part, so XS dodges this issue.
Ian: Might be xenstore path issues, if not to use the switch.

David: It may be that we can remove all use of xenstore while
      deprivelaged, we could avoid the need for xenstore changes.

Agreed someone would need to look though remaining uses, to see how much
work this would be.





[-- Attachment #3: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Device model operation hypercall (DMOP, re qemu depriv)
  2016-07-28 17:01 XenProject/XenServer QEMU working group, Friday 8th July, 2016, 15:00 Jennifer Herbert
@ 2016-08-01 11:32 ` Ian Jackson
  2016-08-01 12:41   ` Jan Beulich
                     ` (2 more replies)
  2016-09-09 16:18 ` XenProject/XenServer QEMU working group minutes, 30th August 2016 Jennifer Herbert
  1 sibling, 3 replies; 74+ messages in thread
From: Ian Jackson @ 2016-08-01 11:32 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, dgdegra, Wei Liu, George Dunlap,
	Stefano Stabellini, Tim Deegan, David Vrabel, Anthony Perard,
	Konrad Rzeszutek Wilk
  Cc: xen-devel

Introducing HVMCTL, Jan wrote:
> A long while back separating out all control kind operations (intended
> for use by only the control domain or device model) from the currect
> hvmop hypercall has been discussed. This series aims at finally making
> this reality (at once allowing to streamline the associated XSM checking).

I think we need to introduce a new hypercall (which I will call DMOP
for now) which may augment or replace some of HVMCTL.  Let me explain:


We would like to be able to deprivilege qemu-in-dom0.  This is
because qemu has a large attack surface and has a history of security
bugs.  If we get this right we can easily reduce the impact of `guest
can take over qemu' bugs to DoS; and perhaps with a bit of effort we
can eliminate the DoS too.  (qemu stubdom are another way to do this
but they have their own difficulties.)

A part of this plan has to be a way for qemu to make hypercalls
related to the guest it is servicing.  But qemu needs to be _unable_
to make _other_ hypercalls.

I see four possible approaches.  In IMO increasing order of
desirability:

1. We could simply patch the dom0 privcmd driver to know exactly which
   hypercalls are permitted.  This is obviously never going to work
   because there would have to be a massive table in the kernel, kept
   in step with Xen.  We could have a kind of pattern matching engine
   instead, and load the tables from userspace, but that's a daft
   edifice to be building (even if we reuse BPF or something) and a
   total pain to maintain.

2. We could have some kind of privileged proxy or helper process,
   which makes the hypercalls on instruction from qemu.  This would be
   quite complicated and involve a lot of back-and-forth parameter
   passing.  Like option 1, this arrangement would end up embedding
   detailed knowledge about which hypercalls are appropriate, and have
   to understand all of their parameters.

3. We could have the dom0 privcmd driver wrap each of qemu's
   hypercalls in a special "wrap up with different XSM tag" hypercall.
   Then, we could specify the set of allowable hypercalls with XSM.
   If we want qemu deprivileged by default, this depends on turning
   XSM on by default.  But we want qemu depriv ASAP and there are
   difficulties with XSM by default.  This approach also involves
   writing a large and hard-to-verify hypercall permission table, in
   the form of an XSM policy.

4. We could invent a new hypercall `DMOP' for hypercalls which device
   models should be able to use, which always has the target domain in
   a fixed location in the arguments.  We have the dom0 privcmd driver
   know about this one hypercall number and the location of the target
   domid.

Option 4 has the following advantages:

* The specification of which hypercalls are authorised to qemu is
  integrated with the specification of the hypercalls themselves:
  There is no need to maintain a separate table which can get out of
  step (or contain security bugs).

* The changes required to the rest of the system are fairly small.
  In particular:

* We need only one small, non-varying, patch to the dom0 kernel.


Let me flesh out option 4 in more detail:


We define a new hypercall DMOP.

Its first argument is always a target domid.  The DMOP hypercall
number and position of the target domid in the arguments are fixed.

A DMOP is defined to never put at risk the stability or security of
the whole system, nor of the domain which calls DMOP.  However, a DMOP
may have arbitrary effects on the target domid.

In the privcmd driver, we provide a new restriction ioctl, which takes
a domid parameter.  After that restriction ioctl is called, the
privcmd driver will permit only DMOP hypercalls, and only with the
specified target domid.

Since the hypercall number and the target domid are stable, this is a
simple check which will not need to be updated as new DMOPs are
defined (and old ones retired).

DMOPs are not available to guests (other than stub device model
domains) and do not form part of the guest-stable ABI.  Where the set
of operations provided through DMOPs overlaps with guest-stable
hypercalls, identical functionality must provided through both
parts of the hypercall namespace.

Privileged toolstack software is permitted to use DMOPs as well as
other hypercalls, of course.  So there is no need to duplicate
functionality between DMOPs and non-stable privileged toolstack
hypercalls.


On ABI/API stability:

For this scheme to work, it is not essential that the DMOPs themselves
should have a stable ABI.

However, we do want to be able to decouple qemu versions from Xen
versions.  This could be done by having the relevant bit of libxc (let
us suppose libdevicemodel) be capable of driving multiple versions of
Xen.  Or by having different libdevicemodel versions, one for each
version of Xen, and some kind of ad-hoc select-the-right-library
arrangement to cope with dual booting.

Alternatively, old DMOP interfaces (ie, old DMOPs) could simply be
retained for a few Xen releases and then retired, providing a
semi-stable ABI to device model software.

In any case, probably the DMOP opcode needs to be a wide field so that
when new DMOPs, or new versions of old DMOPs, arise, we can assign
them new numbers.  (Alternatively we could have a version field in
every DMOP which is checked for equality, but that makes some
compatibility strategies more painful.)


What do people think ?

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-01 11:32 ` Device model operation hypercall (DMOP, re qemu depriv) Ian Jackson
@ 2016-08-01 12:41   ` Jan Beulich
  2016-08-02 11:38     ` Wei Liu
  2016-08-02 11:37   ` Wei Liu
  2016-09-09 15:16   ` Jennifer Herbert
  2 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-01 12:41 UTC (permalink / raw)
  To: Andrew Cooper, Ian Jackson
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Tim Deegan,
	David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 01.08.16 at 13:32, <ian.jackson@eu.citrix.com> wrote:
> 4. We could invent a new hypercall `DMOP' for hypercalls which device
>    models should be able to use, which always has the target domain in
>    a fixed location in the arguments.  We have the dom0 privcmd driver
>    know about this one hypercall number and the location of the target
>    domid.
> 
> Option 4 has the following advantages:
> 
> * The specification of which hypercalls are authorised to qemu is
>   integrated with the specification of the hypercalls themselves:
>   There is no need to maintain a separate table which can get out of
>   step (or contain security bugs).
> 
> * The changes required to the rest of the system are fairly small.
>   In particular:
> 
> * We need only one small, non-varying, patch to the dom0 kernel.
> 
> 
> Let me flesh out option 4 in more detail:
> 
> 
> We define a new hypercall DMOP.
> 
> Its first argument is always a target domid.  The DMOP hypercall
> number and position of the target domid in the arguments are fixed.
> 
> A DMOP is defined to never put at risk the stability or security of
> the whole system, nor of the domain which calls DMOP.  However, a DMOP
> may have arbitrary effects on the target domid.

With the exception of this and the privcmd layer described below,
DMOP == HVMCTL afaics. The privcmd layer is independent anyway.
And the security aspect mentioned above won't disappear if we
use DMOP instead of HVMCTL. So I don't see why the hvmctl
series as is can't be the starting point of this, with the stability/
security concerns addressed subsequently, for being orthogonal.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-01 11:32 ` Device model operation hypercall (DMOP, re qemu depriv) Ian Jackson
  2016-08-01 12:41   ` Jan Beulich
@ 2016-08-02 11:37   ` Wei Liu
  2016-08-02 11:42     ` George Dunlap
  2016-09-09 15:16   ` Jennifer Herbert
  2 siblings, 1 reply; 74+ messages in thread
From: Wei Liu @ 2016-08-02 11:37 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Jan Beulich, Anthony Perard, xen-devel,
	dgdegra

On Mon, Aug 01, 2016 at 12:32:54PM +0100, Ian Jackson wrote:
> Introducing HVMCTL, Jan wrote:
> > A long while back separating out all control kind operations (intended
> > for use by only the control domain or device model) from the currect
> > hvmop hypercall has been discussed. This series aims at finally making
> > this reality (at once allowing to streamline the associated XSM checking).
> 
> I think we need to introduce a new hypercall (which I will call DMOP
> for now) which may augment or replace some of HVMCTL.  Let me explain:
> 
> 
> We would like to be able to deprivilege qemu-in-dom0.  This is
> because qemu has a large attack surface and has a history of security
> bugs.  If we get this right we can easily reduce the impact of `guest
> can take over qemu' bugs to DoS; and perhaps with a bit of effort we
> can eliminate the DoS too.  (qemu stubdom are another way to do this
> but they have their own difficulties.)
> 
> A part of this plan has to be a way for qemu to make hypercalls
> related to the guest it is servicing.  But qemu needs to be _unable_
> to make _other_ hypercalls.
> 
> I see four possible approaches.  In IMO increasing order of
> desirability:
> 
> 1. We could simply patch the dom0 privcmd driver to know exactly which
>    hypercalls are permitted.  This is obviously never going to work
>    because there would have to be a massive table in the kernel, kept
>    in step with Xen.  We could have a kind of pattern matching engine
>    instead, and load the tables from userspace, but that's a daft
>    edifice to be building (even if we reuse BPF or something) and a
>    total pain to maintain.
> 
> 2. We could have some kind of privileged proxy or helper process,
>    which makes the hypercalls on instruction from qemu.  This would be
>    quite complicated and involve a lot of back-and-forth parameter
>    passing.  Like option 1, this arrangement would end up embedding
>    detailed knowledge about which hypercalls are appropriate, and have
>    to understand all of their parameters.
> 
> 3. We could have the dom0 privcmd driver wrap each of qemu's
>    hypercalls in a special "wrap up with different XSM tag" hypercall.
>    Then, we could specify the set of allowable hypercalls with XSM.
>    If we want qemu deprivileged by default, this depends on turning
>    XSM on by default.  But we want qemu depriv ASAP and there are
>    difficulties with XSM by default.  This approach also involves
>    writing a large and hard-to-verify hypercall permission table, in
>    the form of an XSM policy.
> 
> 4. We could invent a new hypercall `DMOP' for hypercalls which device
>    models should be able to use, which always has the target domain in
>    a fixed location in the arguments.  We have the dom0 privcmd driver
>    know about this one hypercall number and the location of the target
>    domid.
> 
> Option 4 has the following advantages:
> 
> * The specification of which hypercalls are authorised to qemu is
>   integrated with the specification of the hypercalls themselves:
>   There is no need to maintain a separate table which can get out of
>   step (or contain security bugs).
> 
> * The changes required to the rest of the system are fairly small.
>   In particular:
> 
> * We need only one small, non-varying, patch to the dom0 kernel.
> 

I think your analysis makes sense.

> 
> Let me flesh out option 4 in more detail:
> 
> 
> We define a new hypercall DMOP.
> 
> Its first argument is always a target domid.  The DMOP hypercall
> number and position of the target domid in the arguments are fixed.
> 
> A DMOP is defined to never put at risk the stability or security of
> the whole system, nor of the domain which calls DMOP.  However, a DMOP
> may have arbitrary effects on the target domid.
> 

I would like to point out that this is non-trivial since we would need
to audit a lot of stuff.

But the requirement to audit interface is not unique to DMOP -- I expect
this is needed for any other approach.

> In the privcmd driver, we provide a new restriction ioctl, which takes
> a domid parameter.  After that restriction ioctl is called, the
> privcmd driver will permit only DMOP hypercalls, and only with the
> specified target domid.
> 

It is phrased like that the guest kernel is supposed to enforce the
policy?  Would it be possible to make Xen do it? I don't think we should
trust DM domain kernel here.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-01 12:41   ` Jan Beulich
@ 2016-08-02 11:38     ` Wei Liu
  2016-08-02 11:58       ` Jan Beulich
  2016-08-03 10:29       ` Ian Jackson
  0 siblings, 2 replies; 74+ messages in thread
From: Wei Liu @ 2016-08-02 11:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, David Vrabel, Anthony Perard, xen-devel,
	dgdegra

On Mon, Aug 01, 2016 at 06:41:20AM -0600, Jan Beulich wrote:
> >>> On 01.08.16 at 13:32, <ian.jackson@eu.citrix.com> wrote:
> > 4. We could invent a new hypercall `DMOP' for hypercalls which device
> >    models should be able to use, which always has the target domain in
> >    a fixed location in the arguments.  We have the dom0 privcmd driver
> >    know about this one hypercall number and the location of the target
> >    domid.
> > 
> > Option 4 has the following advantages:
> > 
> > * The specification of which hypercalls are authorised to qemu is
> >   integrated with the specification of the hypercalls themselves:
> >   There is no need to maintain a separate table which can get out of
> >   step (or contain security bugs).
> > 
> > * The changes required to the rest of the system are fairly small.
> >   In particular:
> > 
> > * We need only one small, non-varying, patch to the dom0 kernel.
> > 
> > 
> > Let me flesh out option 4 in more detail:
> > 
> > 
> > We define a new hypercall DMOP.
> > 
> > Its first argument is always a target domid.  The DMOP hypercall
> > number and position of the target domid in the arguments are fixed.
> > 
> > A DMOP is defined to never put at risk the stability or security of
> > the whole system, nor of the domain which calls DMOP.  However, a DMOP
> > may have arbitrary effects on the target domid.
> 
> With the exception of this and the privcmd layer described below,
> DMOP == HVMCTL afaics. The privcmd layer is independent anyway.
> And the security aspect mentioned above won't disappear if we
> use DMOP instead of HVMCTL. So I don't see why the hvmctl
> series as is can't be the starting point of this, with the stability/
> security concerns addressed subsequently, for being orthogonal.
> 

Yeah, to turn HVMCTL to DMOP:

1. s/HVMCTL/DMOP/
2. maybe s/interface_version//

I think we could at least do #1 and merge the series.

Wei.

> Jan
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-02 11:37   ` Wei Liu
@ 2016-08-02 11:42     ` George Dunlap
  2016-08-02 12:34       ` Wei Liu
  0 siblings, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-08-02 11:42 UTC (permalink / raw)
  To: Wei Liu, Ian Jackson
  Cc: Stefano Stabellini, George Dunlap, Andrew Cooper, Tim Deegan,
	David Vrabel, Jan Beulich, Anthony Perard, xen-devel, dgdegra

On 02/08/16 12:37, Wei Liu wrote:
> On Mon, Aug 01, 2016 at 12:32:54PM +0100, Ian Jackson wrote:
>> Introducing HVMCTL, Jan wrote:
>>> A long while back separating out all control kind operations (intended
>>> for use by only the control domain or device model) from the currect
>>> hvmop hypercall has been discussed. This series aims at finally making
>>> this reality (at once allowing to streamline the associated XSM checking).
>>
>> I think we need to introduce a new hypercall (which I will call DMOP
>> for now) which may augment or replace some of HVMCTL.  Let me explain:
>>
>>
>> We would like to be able to deprivilege qemu-in-dom0.  This is
>> because qemu has a large attack surface and has a history of security
>> bugs.  If we get this right we can easily reduce the impact of `guest
>> can take over qemu' bugs to DoS; and perhaps with a bit of effort we
>> can eliminate the DoS too.  (qemu stubdom are another way to do this
>> but they have their own difficulties.)
>>
>> A part of this plan has to be a way for qemu to make hypercalls
>> related to the guest it is servicing.  But qemu needs to be _unable_
>> to make _other_ hypercalls.
>>
>> I see four possible approaches.  In IMO increasing order of
>> desirability:
>>
>> 1. We could simply patch the dom0 privcmd driver to know exactly which
>>    hypercalls are permitted.  This is obviously never going to work
>>    because there would have to be a massive table in the kernel, kept
>>    in step with Xen.  We could have a kind of pattern matching engine
>>    instead, and load the tables from userspace, but that's a daft
>>    edifice to be building (even if we reuse BPF or something) and a
>>    total pain to maintain.
>>
>> 2. We could have some kind of privileged proxy or helper process,
>>    which makes the hypercalls on instruction from qemu.  This would be
>>    quite complicated and involve a lot of back-and-forth parameter
>>    passing.  Like option 1, this arrangement would end up embedding
>>    detailed knowledge about which hypercalls are appropriate, and have
>>    to understand all of their parameters.
>>
>> 3. We could have the dom0 privcmd driver wrap each of qemu's
>>    hypercalls in a special "wrap up with different XSM tag" hypercall.
>>    Then, we could specify the set of allowable hypercalls with XSM.
>>    If we want qemu deprivileged by default, this depends on turning
>>    XSM on by default.  But we want qemu depriv ASAP and there are
>>    difficulties with XSM by default.  This approach also involves
>>    writing a large and hard-to-verify hypercall permission table, in
>>    the form of an XSM policy.
>>
>> 4. We could invent a new hypercall `DMOP' for hypercalls which device
>>    models should be able to use, which always has the target domain in
>>    a fixed location in the arguments.  We have the dom0 privcmd driver
>>    know about this one hypercall number and the location of the target
>>    domid.
>>
>> Option 4 has the following advantages:
>>
>> * The specification of which hypercalls are authorised to qemu is
>>   integrated with the specification of the hypercalls themselves:
>>   There is no need to maintain a separate table which can get out of
>>   step (or contain security bugs).
>>
>> * The changes required to the rest of the system are fairly small.
>>   In particular:
>>
>> * We need only one small, non-varying, patch to the dom0 kernel.
>>
> 
> I think your analysis makes sense.
> 
>>
>> Let me flesh out option 4 in more detail:
>>
>>
>> We define a new hypercall DMOP.
>>
>> Its first argument is always a target domid.  The DMOP hypercall
>> number and position of the target domid in the arguments are fixed.
>>
>> A DMOP is defined to never put at risk the stability or security of
>> the whole system, nor of the domain which calls DMOP.  However, a DMOP
>> may have arbitrary effects on the target domid.
>>
> 
> I would like to point out that this is non-trivial since we would need
> to audit a lot of stuff.
> 
> But the requirement to audit interface is not unique to DMOP -- I expect
> this is needed for any other approach.
> 
>> In the privcmd driver, we provide a new restriction ioctl, which takes
>> a domid parameter.  After that restriction ioctl is called, the
>> privcmd driver will permit only DMOP hypercalls, and only with the
>> specified target domid.
>>
> 
> It is phrased like that the guest kernel is supposed to enforce the
> policy?  Would it be possible to make Xen do it? I don't think we should
> trust DM domain kernel here.

The problem is that Xen doesn't know what process is running, and so
can't tell whether qemuA is accessing domainA's memory, or whether qemuB
is accessing domainA's memory.

The two options that have been proposed are:

1. Have a way for dom0 to give Xen an XSM tag for the current process
(so Xen can do the enforcing)

2. Have dom0 filter out the calls based on the fact that all the
hypercalls have the same template (i.e., domid in the same position).

Either way you are relying on dom0 ("trusting" dom0) to DTRT -- either
to do the filtering properly, or to give you the right XSM tag.

 -George




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-02 11:38     ` Wei Liu
@ 2016-08-02 11:58       ` Jan Beulich
  2016-08-02 13:02         ` David Vrabel
  2016-08-03 10:29       ` Ian Jackson
  1 sibling, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-02 11:58 UTC (permalink / raw)
  To: Wei Liu
  Cc: StefanoStabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 02.08.16 at 13:38, <wei.liu2@citrix.com> wrote:
> On Mon, Aug 01, 2016 at 06:41:20AM -0600, Jan Beulich wrote:
>> >>> On 01.08.16 at 13:32, <ian.jackson@eu.citrix.com> wrote:
>> > 4. We could invent a new hypercall `DMOP' for hypercalls which device
>> >    models should be able to use, which always has the target domain in
>> >    a fixed location in the arguments.  We have the dom0 privcmd driver
>> >    know about this one hypercall number and the location of the target
>> >    domid.
>> > 
>> > Option 4 has the following advantages:
>> > 
>> > * The specification of which hypercalls are authorised to qemu is
>> >   integrated with the specification of the hypercalls themselves:
>> >   There is no need to maintain a separate table which can get out of
>> >   step (or contain security bugs).
>> > 
>> > * The changes required to the rest of the system are fairly small.
>> >   In particular:
>> > 
>> > * We need only one small, non-varying, patch to the dom0 kernel.
>> > 
>> > 
>> > Let me flesh out option 4 in more detail:
>> > 
>> > 
>> > We define a new hypercall DMOP.
>> > 
>> > Its first argument is always a target domid.  The DMOP hypercall
>> > number and position of the target domid in the arguments are fixed.
>> > 
>> > A DMOP is defined to never put at risk the stability or security of
>> > the whole system, nor of the domain which calls DMOP.  However, a DMOP
>> > may have arbitrary effects on the target domid.
>> 
>> With the exception of this and the privcmd layer described below,
>> DMOP == HVMCTL afaics. The privcmd layer is independent anyway.
>> And the security aspect mentioned above won't disappear if we
>> use DMOP instead of HVMCTL. So I don't see why the hvmctl
>> series as is can't be the starting point of this, with the stability/
>> security concerns addressed subsequently, for being orthogonal.
>> 
> 
> Yeah, to turn HVMCTL to DMOP:
> 
> 1. s/HVMCTL/DMOP/
> 2. maybe s/interface_version//

Andrew had brought up 2 too, but I'm really not sure that'd be a
good idea. I rather think we should keep it but maybe (other than
domctl/sysctl) recognize older versions. In any event I consider
having it better for an unstable interface (as Ian said, libxc is
supposed to provide the stable one).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-02 11:42     ` George Dunlap
@ 2016-08-02 12:34       ` Wei Liu
  0 siblings, 0 replies; 74+ messages in thread
From: Wei Liu @ 2016-08-02 12:34 UTC (permalink / raw)
  To: George Dunlap
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, David Vrabel, Jan Beulich,
	Anthony Perard, xen-devel, dgdegra

On Tue, Aug 02, 2016 at 12:42:36PM +0100, George Dunlap wrote:
> On 02/08/16 12:37, Wei Liu wrote:
> > On Mon, Aug 01, 2016 at 12:32:54PM +0100, Ian Jackson wrote:
> >> Introducing HVMCTL, Jan wrote:
> >>> A long while back separating out all control kind operations (intended
> >>> for use by only the control domain or device model) from the currect
> >>> hvmop hypercall has been discussed. This series aims at finally making
> >>> this reality (at once allowing to streamline the associated XSM checking).
> >>
> >> I think we need to introduce a new hypercall (which I will call DMOP
> >> for now) which may augment or replace some of HVMCTL.  Let me explain:
> >>
> >>
> >> We would like to be able to deprivilege qemu-in-dom0.  This is
> >> because qemu has a large attack surface and has a history of security
> >> bugs.  If we get this right we can easily reduce the impact of `guest
> >> can take over qemu' bugs to DoS; and perhaps with a bit of effort we
> >> can eliminate the DoS too.  (qemu stubdom are another way to do this
> >> but they have their own difficulties.)
> >>
> >> A part of this plan has to be a way for qemu to make hypercalls
> >> related to the guest it is servicing.  But qemu needs to be _unable_
> >> to make _other_ hypercalls.
> >>
> >> I see four possible approaches.  In IMO increasing order of
> >> desirability:
> >>
> >> 1. We could simply patch the dom0 privcmd driver to know exactly which
> >>    hypercalls are permitted.  This is obviously never going to work
> >>    because there would have to be a massive table in the kernel, kept
> >>    in step with Xen.  We could have a kind of pattern matching engine
> >>    instead, and load the tables from userspace, but that's a daft
> >>    edifice to be building (even if we reuse BPF or something) and a
> >>    total pain to maintain.
> >>
> >> 2. We could have some kind of privileged proxy or helper process,
> >>    which makes the hypercalls on instruction from qemu.  This would be
> >>    quite complicated and involve a lot of back-and-forth parameter
> >>    passing.  Like option 1, this arrangement would end up embedding
> >>    detailed knowledge about which hypercalls are appropriate, and have
> >>    to understand all of their parameters.
> >>
> >> 3. We could have the dom0 privcmd driver wrap each of qemu's
> >>    hypercalls in a special "wrap up with different XSM tag" hypercall.
> >>    Then, we could specify the set of allowable hypercalls with XSM.
> >>    If we want qemu deprivileged by default, this depends on turning
> >>    XSM on by default.  But we want qemu depriv ASAP and there are
> >>    difficulties with XSM by default.  This approach also involves
> >>    writing a large and hard-to-verify hypercall permission table, in
> >>    the form of an XSM policy.
> >>
> >> 4. We could invent a new hypercall `DMOP' for hypercalls which device
> >>    models should be able to use, which always has the target domain in
> >>    a fixed location in the arguments.  We have the dom0 privcmd driver
> >>    know about this one hypercall number and the location of the target
> >>    domid.
> >>
> >> Option 4 has the following advantages:
> >>
> >> * The specification of which hypercalls are authorised to qemu is
> >>   integrated with the specification of the hypercalls themselves:
> >>   There is no need to maintain a separate table which can get out of
> >>   step (or contain security bugs).
> >>
> >> * The changes required to the rest of the system are fairly small.
> >>   In particular:
> >>
> >> * We need only one small, non-varying, patch to the dom0 kernel.
> >>
> > 
> > I think your analysis makes sense.
> > 
> >>
> >> Let me flesh out option 4 in more detail:
> >>
> >>
> >> We define a new hypercall DMOP.
> >>
> >> Its first argument is always a target domid.  The DMOP hypercall
> >> number and position of the target domid in the arguments are fixed.
> >>
> >> A DMOP is defined to never put at risk the stability or security of
> >> the whole system, nor of the domain which calls DMOP.  However, a DMOP
> >> may have arbitrary effects on the target domid.
> >>
> > 
> > I would like to point out that this is non-trivial since we would need
> > to audit a lot of stuff.
> > 
> > But the requirement to audit interface is not unique to DMOP -- I expect
> > this is needed for any other approach.
> > 
> >> In the privcmd driver, we provide a new restriction ioctl, which takes
> >> a domid parameter.  After that restriction ioctl is called, the
> >> privcmd driver will permit only DMOP hypercalls, and only with the
> >> specified target domid.
> >>
> > 
> > It is phrased like that the guest kernel is supposed to enforce the
> > policy?  Would it be possible to make Xen do it? I don't think we should
> > trust DM domain kernel here.
> 
> The problem is that Xen doesn't know what process is running, and so
> can't tell whether qemuA is accessing domainA's memory, or whether qemuB
> is accessing domainA's memory.
> 
> The two options that have been proposed are:
> 
> 1. Have a way for dom0 to give Xen an XSM tag for the current process
> (so Xen can do the enforcing)
> 
> 2. Have dom0 filter out the calls based on the fact that all the
> hypercalls have the same template (i.e., domid in the same position).
> 
> Either way you are relying on dom0 ("trusting" dom0) to DTRT -- either
> to do the filtering properly, or to give you the right XSM tag.
> 

Right. I think that's what slipped my mind. Thanks for explaining!

Wei.

>  -George
> 
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-02 11:58       ` Jan Beulich
@ 2016-08-02 13:02         ` David Vrabel
  2016-08-02 13:29           ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: David Vrabel @ 2016-08-02 13:02 UTC (permalink / raw)
  To: Jan Beulich, Wei Liu
  Cc: StefanoStabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Anthony Perard, xen-devel, dgdegra

On 02/08/16 12:58, Jan Beulich wrote:
>>>> On 02.08.16 at 13:38, <wei.liu2@citrix.com> wrote:
>> On Mon, Aug 01, 2016 at 06:41:20AM -0600, Jan Beulich wrote:
>>>>>> On 01.08.16 at 13:32, <ian.jackson@eu.citrix.com> wrote:
>>>> 4. We could invent a new hypercall `DMOP' for hypercalls which device
>>>>    models should be able to use, which always has the target domain in
>>>>    a fixed location in the arguments.  We have the dom0 privcmd driver
>>>>    know about this one hypercall number and the location of the target
>>>>    domid.
>>>>
>>>> Option 4 has the following advantages:
>>>>
>>>> * The specification of which hypercalls are authorised to qemu is
>>>>   integrated with the specification of the hypercalls themselves:
>>>>   There is no need to maintain a separate table which can get out of
>>>>   step (or contain security bugs).
>>>>
>>>> * The changes required to the rest of the system are fairly small.
>>>>   In particular:
>>>>
>>>> * We need only one small, non-varying, patch to the dom0 kernel.
>>>>
>>>>
>>>> Let me flesh out option 4 in more detail:
>>>>
>>>>
>>>> We define a new hypercall DMOP.
>>>>
>>>> Its first argument is always a target domid.  The DMOP hypercall
>>>> number and position of the target domid in the arguments are fixed.
>>>>
>>>> A DMOP is defined to never put at risk the stability or security of
>>>> the whole system, nor of the domain which calls DMOP.  However, a DMOP
>>>> may have arbitrary effects on the target domid.
>>>
>>> With the exception of this and the privcmd layer described below,
>>> DMOP == HVMCTL afaics. The privcmd layer is independent anyway.
>>> And the security aspect mentioned above won't disappear if we
>>> use DMOP instead of HVMCTL. So I don't see why the hvmctl
>>> series as is can't be the starting point of this, with the stability/
>>> security concerns addressed subsequently, for being orthogonal.
>>>
>>
>> Yeah, to turn HVMCTL to DMOP:
>>
>> 1. s/HVMCTL/DMOP/
>> 2. maybe s/interface_version//
> 
> Andrew had brought up 2 too, but I'm really not sure that'd be a
> good idea. I rather think we should keep it but maybe (other than
> domctl/sysctl) recognize older versions. In any event I consider
> having it better for an unstable interface (as Ian said, libxc is
> supposed to provide the stable one).

A stable user space library API is no good for an in-kernel emulator,
like that needed for Intel GVT-g -- the hypercall ABI needs to be stable.

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-02 13:02         ` David Vrabel
@ 2016-08-02 13:29           ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2016-08-02 13:29 UTC (permalink / raw)
  To: David Vrabel
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Anthony Perard, xen-devel, dgdegra

>>> On 02.08.16 at 15:02, <david.vrabel@citrix.com> wrote:
> On 02/08/16 12:58, Jan Beulich wrote:
>>>>> On 02.08.16 at 13:38, <wei.liu2@citrix.com> wrote:
>>> On Mon, Aug 01, 2016 at 06:41:20AM -0600, Jan Beulich wrote:
>>>>>>> On 01.08.16 at 13:32, <ian.jackson@eu.citrix.com> wrote:
>>>>> 4. We could invent a new hypercall `DMOP' for hypercalls which device
>>>>>    models should be able to use, which always has the target domain in
>>>>>    a fixed location in the arguments.  We have the dom0 privcmd driver
>>>>>    know about this one hypercall number and the location of the target
>>>>>    domid.
>>>>>
>>>>> Option 4 has the following advantages:
>>>>>
>>>>> * The specification of which hypercalls are authorised to qemu is
>>>>>   integrated with the specification of the hypercalls themselves:
>>>>>   There is no need to maintain a separate table which can get out of
>>>>>   step (or contain security bugs).
>>>>>
>>>>> * The changes required to the rest of the system are fairly small.
>>>>>   In particular:
>>>>>
>>>>> * We need only one small, non-varying, patch to the dom0 kernel.
>>>>>
>>>>>
>>>>> Let me flesh out option 4 in more detail:
>>>>>
>>>>>
>>>>> We define a new hypercall DMOP.
>>>>>
>>>>> Its first argument is always a target domid.  The DMOP hypercall
>>>>> number and position of the target domid in the arguments are fixed.
>>>>>
>>>>> A DMOP is defined to never put at risk the stability or security of
>>>>> the whole system, nor of the domain which calls DMOP.  However, a DMOP
>>>>> may have arbitrary effects on the target domid.
>>>>
>>>> With the exception of this and the privcmd layer described below,
>>>> DMOP == HVMCTL afaics. The privcmd layer is independent anyway.
>>>> And the security aspect mentioned above won't disappear if we
>>>> use DMOP instead of HVMCTL. So I don't see why the hvmctl
>>>> series as is can't be the starting point of this, with the stability/
>>>> security concerns addressed subsequently, for being orthogonal.
>>>>
>>>
>>> Yeah, to turn HVMCTL to DMOP:
>>>
>>> 1. s/HVMCTL/DMOP/
>>> 2. maybe s/interface_version//
>> 
>> Andrew had brought up 2 too, but I'm really not sure that'd be a
>> good idea. I rather think we should keep it but maybe (other than
>> domctl/sysctl) recognize older versions. In any event I consider
>> having it better for an unstable interface (as Ian said, libxc is
>> supposed to provide the stable one).
> 
> A stable user space library API is no good for an in-kernel emulator,
> like that needed for Intel GVT-g -- the hypercall ABI needs to be stable.

I'm pretty certain only a (perhaps small) subset of the proposed new
operations would be needed by them, which we could then consider
marking stable.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-02 11:38     ` Wei Liu
  2016-08-02 11:58       ` Jan Beulich
@ 2016-08-03 10:29       ` Ian Jackson
  2016-08-03 12:03         ` Jan Beulich
  1 sibling, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-03 10:29 UTC (permalink / raw)
  To: Wei Liu
  Cc: StefanoStabellini, George Dunlap, Andrew Cooper, Tim Deegan,
	David Vrabel, Jan Beulich, Anthony Perard, xen-devel, dgdegra

Wei Liu writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On Mon, Aug 01, 2016 at 06:41:20AM -0600, Jan Beulich wrote:
> > > A DMOP is defined to never put at risk the stability or security of
> > > the whole system, nor of the domain which calls DMOP.  However, a DMOP
> > > may have arbitrary effects on the target domid.
> > 
> > With the exception of this and the privcmd layer described below,
> > DMOP == HVMCTL afaics. The privcmd layer is independent anyway.
> > And the security aspect mentioned above won't disappear if we
> > use DMOP instead of HVMCTL. So I don't see why the hvmctl
> > series as is can't be the starting point of this, with the stability/
> > security concerns addressed subsequently, for being orthogonal.

I don't (currently) have a clear understanding of how my proposed DMOP
relates to HVMCTL.

I thought it useful to set out the DMOP proposal from first
principles, with clear motivation, discussion of not-chosen
alternatives, and of course with a clear statement of the principles
of operation and of the security design.

The security property I have quoted above is absolutely critical to
the DMOP proposal.  I'm a bit concerned by comments like the above
`with the exception of this' (which seems to refer to the security
property).

Earlier during one of the HVMCTL threads I asked

    This is a slight digression, but is it intended that all of these
    hvmctl's are safe to expose to a deprivileged device model process in
    dom0, or to a device model stub domain ?

Jan replied:

    Yes, afaict (they've been exposed the same way before).

Does that mean that functionality exposed by all the prooposed HVMCTLs
is currently available to stubdoms ?

> Yeah, to turn HVMCTL to DMOP:
> 
> 1. s/HVMCTL/DMOP/
> 2. maybe s/interface_version//

Well, that would certainly be nice.  But there are some caveats I
would like sorting out.

> >  So I don't see why the hvmctl series as is can't be the starting
> > point of this, with the stability/ security concerns addressed
> > subsequently, for being orthogonal.

Please don't misunderstand me as trying to compete with or block
your HVMCTL work.  It may well be that HVMCTL is what I want, but:

If we adopt the design principles I describe in my DMOP proposal, I
don't think the security concerns are separable.

ISTM that a patch series introducing DMOP should start with a patch
which introduces the DMOP hypercall, with no sub-operations.

Such a patch would have code content very like that in
  [PATCH 01/11] public / x86: introduce hvmctl hypercall

But, such a patch should also explain the semantics.  The Xen public
headers ought to contain explanations of the promises that the
hypervisor makes about DMOP.  Importantly:
 - the promise that a DMOP cannot harm anyone except the target domid
 - the ABI stability of the target domid field
 - what the ABI stability policy is wrt the actual DMOPs themselves

If the 01/ patch contains such promises, then logically the 02/ patch
which introduces the first DMOP is extending that promise to that
operation.  It is at that point that the security decision should be
made.

Now, there may be other ways to represent/record the security status.
But it will be necessary to either (i) avoid violating the DMOP
security promise, by making questionable calls not available via DMOP
or (ii) trying to retrofit the security promise to DMOP later.

I think (ii) is not a good approach.  It would amount to introducing a
whole new set of interfaces, and then later trying to redefine them to
have a particular security property which was not originally there.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-03 10:29       ` Ian Jackson
@ 2016-08-03 12:03         ` Jan Beulich
  2016-08-03 13:37           ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-03 12:03 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 03.08.16 at 12:29, <ian.jackson@eu.citrix.com> wrote:
> Wei Liu writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> On Mon, Aug 01, 2016 at 06:41:20AM -0600, Jan Beulich wrote:
>> > > A DMOP is defined to never put at risk the stability or security of
>> > > the whole system, nor of the domain which calls DMOP.  However, a DMOP
>> > > may have arbitrary effects on the target domid.
>> > 
>> > With the exception of this and the privcmd layer described below,
>> > DMOP == HVMCTL afaics. The privcmd layer is independent anyway.
>> > And the security aspect mentioned above won't disappear if we
>> > use DMOP instead of HVMCTL. So I don't see why the hvmctl
>> > series as is can't be the starting point of this, with the stability/
>> > security concerns addressed subsequently, for being orthogonal.
> 
> I don't (currently) have a clear understanding of how my proposed DMOP
> relates to HVMCTL.
> 
> I thought it useful to set out the DMOP proposal from first
> principles, with clear motivation, discussion of not-chosen
> alternatives, and of course with a clear statement of the principles
> of operation and of the security design.

Okay; nevertheless I did get the feeling that some of this was
prompted by the hvmctl series posting.

> The security property I have quoted above is absolutely critical to
> the DMOP proposal.  I'm a bit concerned by comments like the above
> `with the exception of this' (which seems to refer to the security
> property).

Indeed it does.

> Earlier during one of the HVMCTL threads I asked
> 
>     This is a slight digression, but is it intended that all of these
>     hvmctl's are safe to expose to a deprivileged device model process in
>     dom0, or to a device model stub domain ?
> 
> Jan replied:
> 
>     Yes, afaict (they've been exposed the same way before).
> 
> Does that mean that functionality exposed by all the prooposed HVMCTLs
> is currently available to stubdoms ?

That series only moves code from one hypercall to another (new) one,
without any security implications at all. What has been available to
stubdoms prior to that series will be available the same way once it
got applied.

>> >  So I don't see why the hvmctl series as is can't be the starting
>> > point of this, with the stability/ security concerns addressed
>> > subsequently, for being orthogonal.
> 
> Please don't misunderstand me as trying to compete with or block
> your HVMCTL work.  It may well be that HVMCTL is what I want, but:
> 
> If we adopt the design principles I describe in my DMOP proposal, I
> don't think the security concerns are separable.
> 
> ISTM that a patch series introducing DMOP should start with a patch
> which introduces the DMOP hypercall, with no sub-operations.
> 
> Such a patch would have code content very like that in
>   [PATCH 01/11] public / x86: introduce hvmctl hypercall
> 
> But, such a patch should also explain the semantics.  The Xen public
> headers ought to contain explanations of the promises that the
> hypervisor makes about DMOP.  Importantly:
>  - the promise that a DMOP cannot harm anyone except the target domid
>  - the ABI stability of the target domid field
>  - what the ABI stability policy is wrt the actual DMOPs themselves

Well, none of that was the original goal of the series; some of this
could be merged in.

> If the 01/ patch contains such promises, then logically the 02/ patch
> which introduces the first DMOP is extending that promise to that
> operation.  It is at that point that the security decision should be
> made.

Correct. Yet again the original goal of the series was just proper
separation of two groups of operations that should never have
been all thrown under the same hypercall.

> Now, there may be other ways to represent/record the security status.
> But it will be necessary to either (i) avoid violating the DMOP
> security promise, by making questionable calls not available via DMOP
> or (ii) trying to retrofit the security promise to DMOP later.
> 
> I think (ii) is not a good approach.  It would amount to introducing a
> whole new set of interfaces, and then later trying to redefine them to
> have a particular security property which was not originally there.

I agree that (i) would be the better approach, but I don't think I
can assess when I would find the time to do the required auditing
of all involved code. Plus I don't see the difference between going
the (ii) route with the presented hvmctl series vs keeping things as
they are (under hvmop) - in both cases the security promise will
need to be retrofit onto existing code.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-03 12:03         ` Jan Beulich
@ 2016-08-03 13:37           ` Ian Jackson
  2016-08-03 14:16             ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-03 13:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On 03.08.16 at 12:29, <ian.jackson@eu.citrix.com> wrote:
> > I thought it useful to set out the DMOP proposal from first
> > principles, with clear motivation, discussion of not-chosen
> > alternatives, and of course with a clear statement of the principles
> > of operation and of the security design.
> 
> Okay; nevertheless I did get the feeling that some of this was
> prompted by the hvmctl series posting.

Well, I think the HVMCTL series is an important part of this
conversation - even though this DMOP idea is motivated by the qemu
depriv proposals (initiated mostly by Stefano in the last release
cycle).

> > Does that mean that functionality exposed by all the prooposed HVMCTLs
> > is currently available to stubdoms ?
> 
> That series only moves code from one hypercall to another (new) one,
> without any security implications at all. What has been available to
> stubdoms prior to that series will be available the same way once it
> got applied.

So what you mean is that in HVMCTL, the privilege check is retained in
the individual HVMCTL sub-operation ?  (Ie what used to be IS_PRIV or
IS_PRIV_FOR - and implicitly, some sub-ops would be IS_PRIV and some
IS_PRIV_FOR.)

But looking at your v2 01/11, I see this in do_hvmctl:

   +    rc = xsm_hvm_control(XSM_DM_PRIV, d, op.cmd);
   +    if ( rc )
   +    {
   +        rcu_unlock_domain(d);
   +        return rc;
   +    }

And looking at v2 02/11, I didn't see any privilege check in the
specific hypercall.

With DMOP it would make sense for the privilege check to be
IS_PRIV_FOR, in the DMOP dispatcher.  But it seems that this privilege
check already exists in HVMCTL in the right form.

So AFAICT HVMCTLs already guarantee not to have an adverse impact on
the whole system.  If that weren't the case, then a stub dm could
exploit the situation.

Is the audit that is required, to check that the DMOP doesn't have an
adverse effect on the _calling domain_ ?  AFAICT most HVMCTLs/DMOPs
have /no/ effect on the calling domain, other than as part of argument
passing.  So that audit should be easy.

So I am confused.  What am I missing ?

> > If the 01/ patch contains such promises, then logically the 02/ patch
> > which introduces the first DMOP is extending that promise to that
> > operation.  It is at that point that the security decision should be
> > made.
> 
> Correct. Yet again the original goal of the series was just proper
> separation of two groups of operations that should never have
> been all thrown under the same hypercall.

What I am doing is presenting another, additional goal.  I don't
dispute the usefulness of the HVMCTL cleanup.  But I want to
understand our future intentions.

In particular, as a proponent of the DMOP idea, I want to avoid the
situation that my idea is blocked somehow by a conflicting series.

> > Now, there may be other ways to represent/record the security status.
> > But it will be necessary to either (i) avoid violating the DMOP
> > security promise, by making questionable calls not available via DMOP
> > or (ii) trying to retrofit the security promise to DMOP later.
> > 
> > I think (ii) is not a good approach.  It would amount to introducing a
> > whole new set of interfaces, and then later trying to redefine them to
> > have a particular security property which was not originally there.
> 
> I agree that (i) would be the better approach, but I don't think I
> can assess when I would find the time to do the required auditing
> of all involved code. Plus I don't see the difference between going
> the (ii) route with the presented hvmctl series vs keeping things as
> they are (under hvmop) - in both cases the security promise will
> need to be retrofit onto existing code.

If we don't apply HVMCTL, we can introduce DMOP and then individually
move hypercalls from hvmop to DMOP as they are audited.

Would a similar approach be acceptable even after HVMCTL ?

That is, the following plan:

1. Apply HVMCTL right away.  This solves the cleanup problem,
   but leaves the qemu depriv problem unsolved.

2. After the necessary discussion to understand and refine the DMOP
   design, create a new DMOP hypercall with appropriate security
   promises and whatever stability promises are agreed, but with no
   sub-ops.

3. Move each HVMCTL to DMOP, one by one, as it is audited.

4. Perhaps some HVMCTLs will remain which are inherently unsuitable
   for use with qemu depriv.  If not, perhaps eventually delete HVMCTL
   entirely, or leave it empty (with no subops).

This has the downside that we end up moving all the DMOPs twice.  But
it does allow us to separate the audit work from the cleanup/reorg.

Regards,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-03 13:37           ` Ian Jackson
@ 2016-08-03 14:16             ` Jan Beulich
  2016-08-03 14:21               ` George Dunlap
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-03 14:16 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 03.08.16 at 15:37, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> On 03.08.16 at 12:29, <ian.jackson@eu.citrix.com> wrote:
>> > Does that mean that functionality exposed by all the prooposed HVMCTLs
>> > is currently available to stubdoms ?
>> 
>> That series only moves code from one hypercall to another (new) one,
>> without any security implications at all. What has been available to
>> stubdoms prior to that series will be available the same way once it
>> got applied.
> 
> So what you mean is that in HVMCTL, the privilege check is retained in
> the individual HVMCTL sub-operation ?  (Ie what used to be IS_PRIV or
> IS_PRIV_FOR - and implicitly, some sub-ops would be IS_PRIV and some
> IS_PRIV_FOR.)
> 
> But looking at your v2 01/11, I see this in do_hvmctl:
> 
>    +    rc = xsm_hvm_control(XSM_DM_PRIV, d, op.cmd);
>    +    if ( rc )
>    +    {
>    +        rcu_unlock_domain(d);
>    +        return rc;
>    +    }
> 
> And looking at v2 02/11, I didn't see any privilege check in the
> specific hypercall.

Of course - there were no explicit IS_PRIV{,_FOR}() uses before
(i.e. the patch also doesn't remove any), these all sit behind the
XSM hook. And no, there are no per-sub-op privilege checks,
they're being consolidated to just the one you quote above.

> With DMOP it would make sense for the privilege check to be
> IS_PRIV_FOR, in the DMOP dispatcher.  But it seems that this privilege
> check already exists in HVMCTL in the right form.
> 
> So AFAICT HVMCTLs already guarantee not to have an adverse impact on
> the whole system.  If that weren't the case, then a stub dm could
> exploit the situation.

And to tell you the truth, I'm not entirely convinced that all the
code implementing those operations (be it in there current hvmop
form or the new hvmctl one - again, the series only moves code
around) is really safe in this regard. But with there even being
at least one domctl not on xsm-flask.txt's safe-for-disaggregation
list, but reportedly used by qemu (I don't recall right now which
exact one it is), stubdom-s can't be considered fully secure right
now anyway.

> Is the audit that is required, to check that the DMOP doesn't have an
> adverse effect on the _calling domain_ ?  AFAICT most HVMCTLs/DMOPs
> have /no/ effect on the calling domain, other than as part of argument
> passing.  So that audit should be easy.
> 
> So I am confused.  What am I missing ?

The main adverse effect on the whole system that I can see
would be long latency operations, but I can't exclude others: Just
look at the final patch of the series, which fixes a serialization bug
which I noticed while doing the code movement. I don't think lack
of proper serialization is guaranteed to only affect the target
domain.

>> > Now, there may be other ways to represent/record the security status.
>> > But it will be necessary to either (i) avoid violating the DMOP
>> > security promise, by making questionable calls not available via DMOP
>> > or (ii) trying to retrofit the security promise to DMOP later.
>> > 
>> > I think (ii) is not a good approach.  It would amount to introducing a
>> > whole new set of interfaces, and then later trying to redefine them to
>> > have a particular security property which was not originally there.
>> 
>> I agree that (i) would be the better approach, but I don't think I
>> can assess when I would find the time to do the required auditing
>> of all involved code. Plus I don't see the difference between going
>> the (ii) route with the presented hvmctl series vs keeping things as
>> they are (under hvmop) - in both cases the security promise will
>> need to be retrofit onto existing code.
> 
> If we don't apply HVMCTL, we can introduce DMOP and then individually
> move hypercalls from hvmop to DMOP as they are audited.
> 
> Would a similar approach be acceptable even after HVMCTL ?
> 
> That is, the following plan:
> 
> 1. Apply HVMCTL right away.  This solves the cleanup problem,
>    but leaves the qemu depriv problem unsolved.
> 
> 2. After the necessary discussion to understand and refine the DMOP
>    design, create a new DMOP hypercall with appropriate security
>    promises and whatever stability promises are agreed, but with no
>    sub-ops.
> 
> 3. Move each HVMCTL to DMOP, one by one, as it is audited.
> 
> 4. Perhaps some HVMCTLs will remain which are inherently unsuitable
>    for use with qemu depriv.  If not, perhaps eventually delete HVMCTL
>    entirely, or leave it empty (with no subops).
> 
> This has the downside that we end up moving all the DMOPs twice.  But
> it does allow us to separate the audit work from the cleanup/reorg.

Well, moving things twice doesn't sound attractive. I was rather
thinking of declaring hvmctl (or dmop, if that's to intended name)
sub-ops secure one by one as they get audited. Part of my reason
to be hesitant to do such an audit myself (apart from the time
constraint) is me remembering my failure to do so properly as a
follow-up to XSA-77 (which needed to be corrected a couple of
months back), plus not really having got any help with that back
then (which arguably might end up better here, as this appears to
be an area of wider interest).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-03 14:16             ` Jan Beulich
@ 2016-08-03 14:21               ` George Dunlap
  2016-08-03 16:10                 ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-08-03 14:21 UTC (permalink / raw)
  To: Jan Beulich, Ian Jackson
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

On 03/08/16 15:16, Jan Beulich wrote:
>>>> On 03.08.16 at 15:37, <ian.jackson@eu.citrix.com> wrote:
>> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
>> depriv)"):
>>> On 03.08.16 at 12:29, <ian.jackson@eu.citrix.com> wrote:
>>>> Does that mean that functionality exposed by all the prooposed HVMCTLs
>>>> is currently available to stubdoms ?
>>>
>>> That series only moves code from one hypercall to another (new) one,
>>> without any security implications at all. What has been available to
>>> stubdoms prior to that series will be available the same way once it
>>> got applied.
>>
>> So what you mean is that in HVMCTL, the privilege check is retained in
>> the individual HVMCTL sub-operation ?  (Ie what used to be IS_PRIV or
>> IS_PRIV_FOR - and implicitly, some sub-ops would be IS_PRIV and some
>> IS_PRIV_FOR.)
>>
>> But looking at your v2 01/11, I see this in do_hvmctl:
>>
>>    +    rc = xsm_hvm_control(XSM_DM_PRIV, d, op.cmd);
>>    +    if ( rc )
>>    +    {
>>    +        rcu_unlock_domain(d);
>>    +        return rc;
>>    +    }
>>
>> And looking at v2 02/11, I didn't see any privilege check in the
>> specific hypercall.
> 
> Of course - there were no explicit IS_PRIV{,_FOR}() uses before
> (i.e. the patch also doesn't remove any), these all sit behind the
> XSM hook. And no, there are no per-sub-op privilege checks,
> they're being consolidated to just the one you quote above.
> 
>> With DMOP it would make sense for the privilege check to be
>> IS_PRIV_FOR, in the DMOP dispatcher.  But it seems that this privilege
>> check already exists in HVMCTL in the right form.
>>
>> So AFAICT HVMCTLs already guarantee not to have an adverse impact on
>> the whole system.  If that weren't the case, then a stub dm could
>> exploit the situation.
> 
> And to tell you the truth, I'm not entirely convinced that all the
> code implementing those operations (be it in there current hvmop
> form or the new hvmctl one - again, the series only moves code
> around) is really safe in this regard. But with there even being
> at least one domctl not on xsm-flask.txt's safe-for-disaggregation
> list, but reportedly used by qemu (I don't recall right now which
> exact one it is), stubdom-s can't be considered fully secure right
> now anyway.
> 
>> Is the audit that is required, to check that the DMOP doesn't have an
>> adverse effect on the _calling domain_ ?  AFAICT most HVMCTLs/DMOPs
>> have /no/ effect on the calling domain, other than as part of argument
>> passing.  So that audit should be easy.
>>
>> So I am confused.  What am I missing ?
> 
> The main adverse effect on the whole system that I can see
> would be long latency operations, but I can't exclude others: Just
> look at the final patch of the series, which fixes a serialization bug
> which I noticed while doing the code movement. I don't think lack
> of proper serialization is guaranteed to only affect the target
> domain.
> 
>>>> Now, there may be other ways to represent/record the security status.
>>>> But it will be necessary to either (i) avoid violating the DMOP
>>>> security promise, by making questionable calls not available via DMOP
>>>> or (ii) trying to retrofit the security promise to DMOP later.
>>>>
>>>> I think (ii) is not a good approach.  It would amount to introducing a
>>>> whole new set of interfaces, and then later trying to redefine them to
>>>> have a particular security property which was not originally there.
>>>
>>> I agree that (i) would be the better approach, but I don't think I
>>> can assess when I would find the time to do the required auditing
>>> of all involved code. Plus I don't see the difference between going
>>> the (ii) route with the presented hvmctl series vs keeping things as
>>> they are (under hvmop) - in both cases the security promise will
>>> need to be retrofit onto existing code.
>>
>> If we don't apply HVMCTL, we can introduce DMOP and then individually
>> move hypercalls from hvmop to DMOP as they are audited.
>>
>> Would a similar approach be acceptable even after HVMCTL ?
>>
>> That is, the following plan:
>>
>> 1. Apply HVMCTL right away.  This solves the cleanup problem,
>>    but leaves the qemu depriv problem unsolved.
>>
>> 2. After the necessary discussion to understand and refine the DMOP
>>    design, create a new DMOP hypercall with appropriate security
>>    promises and whatever stability promises are agreed, but with no
>>    sub-ops.
>>
>> 3. Move each HVMCTL to DMOP, one by one, as it is audited.
>>
>> 4. Perhaps some HVMCTLs will remain which are inherently unsuitable
>>    for use with qemu depriv.  If not, perhaps eventually delete HVMCTL
>>    entirely, or leave it empty (with no subops).
>>
>> This has the downside that we end up moving all the DMOPs twice.  But
>> it does allow us to separate the audit work from the cleanup/reorg.
> 
> Well, moving things twice doesn't sound attractive. I was rather
> thinking of declaring hvmctl (or dmop, if that's to intended name)
> sub-ops secure one by one as they get audited. Part of my reason
> to be hesitant to do such an audit myself (apart from the time
> constraint) is me remembering my failure to do so properly as a
> follow-up to XSA-77 (which needed to be corrected a couple of
> months back), plus not really having got any help with that back
> then (which arguably might end up better here, as this appears to
> be an area of wider interest).

There is certainly a wider interest, and I don't think anyone was
suggesting that you would be doing any of the work we're discussing to
allow qemu depriv.

But for the qemu depriv perspective, we want filtering to happen at
dom0, and we want dom0 to have a very simple rule for knowing whether to
allow a call or not: "Is the hypercall equal to the DMOP hypercall?"

So before qemu devpriv can be usable, *all* the HVMCTL operations would
need to be audited, and those that were deemed insecure would need to be
either fixed or removed.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-03 14:21               ` George Dunlap
@ 2016-08-03 16:10                 ` Ian Jackson
  2016-08-03 16:18                   ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-03 16:10 UTC (permalink / raw)
  To: George Dunlap
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Tim Deegan, David Vrabel, Jan Beulich, Anthony Perard, xen-devel,
	dgdegra

George Dunlap writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> So before qemu devpriv can be usable, *all* the HVMCTL operations would
> need to be audited, and those that were deemed insecure would need to be
> either fixed or removed.

Even worse, the bad HVMCTLs would be retrospectively turned into
security-bugs-in-old-hypervisors.  I don't think this is tenable.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-03 16:10                 ` Ian Jackson
@ 2016-08-03 16:18                   ` Jan Beulich
  2016-08-04 11:21                     ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-03 16:18 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Tim Deegan, George Dunlap, David Vrabel, Anthony Perard,
	xen-devel, dgdegra

>>> On 03.08.16 at 18:10, <ian.jackson@eu.citrix.com> wrote:
> George Dunlap writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> So before qemu devpriv can be usable, *all* the HVMCTL operations would
>> need to be audited, and those that were deemed insecure would need to be
>> either fixed or removed.
> 
> Even worse, the bad HVMCTLs would be retrospectively turned into
> security-bugs-in-old-hypervisors.  I don't think this is tenable.

How would a bug in the respective current hvmop then not be a
security issue as well?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-03 16:18                   ` Jan Beulich
@ 2016-08-04 11:21                     ` Ian Jackson
  2016-08-04 13:24                       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-04 11:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	George Dunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On 03.08.16 at 18:10, <ian.jackson@eu.citrix.com> wrote:
> > George Dunlap writes ("Re: Device model operation hypercall (DMOP, re qemu 
> > depriv)"):
> >> So before qemu devpriv can be usable, *all* the HVMCTL operations would
> >> need to be audited, and those that were deemed insecure would need to be
> >> either fixed or removed.
> > 
> > Even worse, the bad HVMCTLs would be retrospectively turned into
> > security-bugs-in-old-hypervisors.  I don't think this is tenable.
> 
> How would a bug in the respective current hvmop then not be a
> security issue as well?

AFAICT there are two kinds of possible bug:

1. An HVMCTL (or hvmop) that can have an adverse affect on the whole
system.  Such bugs would already be security bugs, covered by our
security support policy.  Such a bug would be a security bug whether
the operation is an hvmop or an HVMCTL or a DMOP.

2. An HVMCTL (or hvmop) that can have an adverse effect on the calling
domain.  Such bugs are not currently security bugs.  But the of qemu
depriv project requires them to be removed.  Such such a bug is a
security bug if it is a DMOP[1] but not otherwise.

By "is a security bug if it is a DMOP" I mean "is a security bug if
exposed via a hypercall which promises the security property necessary
for DMOP, even if that hypercall is called something else".

Strictly speaking, in order to move existing hypercalls into a new
DMOP hypercall, no further auditing is required for bugs of class 1
(other than that the privilege check is correct, ie that the new DMOP
is indeed currently available to stub device models).

But in order to move hypercalls into a new DMOP hypercall, they need
to be checked for bugs of class 2.

If we move some hypercalls which have potential bugs of class 2 from
hvmop to HVMCTL, and then later think we want to create something like
DMOP, we have a number of choices:

We can invent a new DMOP hypercall and start moving everything from
HVMCTL to DMOP as we audit it.

We could audit every HVMCTL, recording the audit status in-tree
somewhere, moving hypercalls with class 2 problems out of HVMCTL to a
new hypercall, and eventually, after we have audited everything, we
could change HVMCTL's hypercall number (and maybe rename it DMOP).

What we cannot do is audit every HVMCTL, fix the class 2 problems, and
then declare HVMCTL to have the relevant security property, and
implement corresponding code in dom0's privcmd drivers which relies on
the security property.  This is because the dom0 privcmd driver
doesn't know whether the HVMCTLs it is allowing not-fully-trusted
userspace to make are actually trustworthy (with the specific
hypervisor version in question.)

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-04 11:21                     ` Ian Jackson
@ 2016-08-04 13:24                       ` Jan Beulich
  2016-08-05 16:28                         ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-04 13:24 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 04.08.16 at 13:21, <ian.jackson@eu.citrix.com> wrote:
> What we cannot do is audit every HVMCTL, fix the class 2 problems, and
> then declare HVMCTL to have the relevant security property, and
> implement corresponding code in dom0's privcmd drivers which relies on
> the security property.  This is because the dom0 privcmd driver
> doesn't know whether the HVMCTLs it is allowing not-fully-trusted
> userspace to make are actually trustworthy (with the specific
> hypervisor version in question.)

I continue to not really understand this argumentation: Dom0's
privcmd driver doesn't really matter here. If there's a bug in
something qemu uses, this is a problem no matter whether that
operation gets called though the to-be-added privcmd logic, or
straight from a stubdom qemu. Both are less than fully privileged.
What do I continue to be missing?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-04 13:24                       ` Jan Beulich
@ 2016-08-05 16:28                         ` Ian Jackson
  2016-08-08 11:18                           ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-05 16:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On 04.08.16 at 13:21, <ian.jackson@eu.citrix.com> wrote:
> > What we cannot do is audit every HVMCTL, fix the class 2 problems, and
> > then declare HVMCTL to have the relevant security property, and
> > implement corresponding code in dom0's privcmd drivers which relies on
> > the security property.  This is because the dom0 privcmd driver
> > doesn't know whether the HVMCTLs it is allowing not-fully-trusted
> > userspace to make are actually trustworthy (with the specific
> > hypervisor version in question.)
> 
> I continue to not really understand this argumentation: Dom0's
> privcmd driver doesn't really matter here. If there's a bug in
> something qemu uses, this is a problem no matter whether that
> operation gets called though the to-be-added privcmd logic, or
> straight from a stubdom qemu. Both are less than fully privileged.
> What do I continue to be missing?

Let me try again.  Earlier I wrote:

  AFAICT there are two kinds of possible bug:

  1. An HVMCTL (or hvmop) that can have an adverse affect on the whole
  system.  Such bugs would already be security bugs, covered by our
  security support policy.  Such a bug would be a security bug whether
  the operation is an hvmop or an HVMCTL or a DMOP.

  2. An HVMCTL (or hvmop) that can have an adverse effect on the calling
  domain.  Such bugs are not currently security bugs.  But the of qemu
  depriv project requires them to be removed.  Such such a bug is a
  security bug if it is a DMOP[1] but not otherwise.

Bugs of class 1 are already security bugs.  They can already be
exploited by stub device models.

Bugs of class 2 are only security bugs if we allow unprivileged
callers within a privileged domain to use the corresponding hypercall.

That is, a bug of class 2 would allow the unprivileged qemu process in
dom0 to cause damage to other parts of dom0.

Bugs of class 2 are of no interest to an attacker who has control of a
stub device model, because all it allows them to do is damage the
device domain domain (which they already control).

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-05 16:28                         ` Ian Jackson
@ 2016-08-08 11:18                           ` Jan Beulich
  2016-08-08 13:46                             ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-08 11:18 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 05.08.16 at 18:28, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> On 04.08.16 at 13:21, <ian.jackson@eu.citrix.com> wrote:
>> > What we cannot do is audit every HVMCTL, fix the class 2 problems, and
>> > then declare HVMCTL to have the relevant security property, and
>> > implement corresponding code in dom0's privcmd drivers which relies on
>> > the security property.  This is because the dom0 privcmd driver
>> > doesn't know whether the HVMCTLs it is allowing not-fully-trusted
>> > userspace to make are actually trustworthy (with the specific
>> > hypervisor version in question.)
>> 
>> I continue to not really understand this argumentation: Dom0's
>> privcmd driver doesn't really matter here. If there's a bug in
>> something qemu uses, this is a problem no matter whether that
>> operation gets called though the to-be-added privcmd logic, or
>> straight from a stubdom qemu. Both are less than fully privileged.
>> What do I continue to be missing?
> 
> Let me try again.  Earlier I wrote:
> 
>   AFAICT there are two kinds of possible bug:
> 
>   1. An HVMCTL (or hvmop) that can have an adverse affect on the whole
>   system.  Such bugs would already be security bugs, covered by our
>   security support policy.  Such a bug would be a security bug whether
>   the operation is an hvmop or an HVMCTL or a DMOP.
> 
>   2. An HVMCTL (or hvmop) that can have an adverse effect on the calling
>   domain.  Such bugs are not currently security bugs.  But the of qemu
>   depriv project requires them to be removed.  Such such a bug is a
>   security bug if it is a DMOP[1] but not otherwise.
> 
> Bugs of class 1 are already security bugs.  They can already be
> exploited by stub device models.
> 
> Bugs of class 2 are only security bugs if we allow unprivileged
> callers within a privileged domain to use the corresponding hypercall.
> 
> That is, a bug of class 2 would allow the unprivileged qemu process in
> dom0 to cause damage to other parts of dom0.
> 
> Bugs of class 2 are of no interest to an attacker who has control of a
> stub device model, because all it allows them to do is damage the
> device domain domain (which they already control).

Ah, okay, I think I finally understand. I'm sorry for being dense.

I'm having, however, a hard time imagining a class 2 bug for any
of the hvmop-s that are being converted by the hvmctl series:
These act on the target domain, so would not touch the calling
ones state other than for copying argument structures to/from
guest memory (and I don't view mixing up domain pointers as
a likely source of problems). Any other problem they might
reasonably have would then affect the system as a whole (class
1) or just the target domain (non-security bug).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-08 11:18                           ` Jan Beulich
@ 2016-08-08 13:46                             ` Ian Jackson
  2016-08-08 14:07                               ` Jan Beulich
  2016-08-09 10:29                               ` Jan Beulich
  0 siblings, 2 replies; 74+ messages in thread
From: Ian Jackson @ 2016-08-08 13:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On 05.08.16 at 18:28, <ian.jackson@eu.citrix.com> wrote:
> > That is, a bug of class 2 would allow the unprivileged qemu process in
> > dom0 to cause damage to other parts of dom0.
...
> Ah, okay, I think I finally understand. [...]
> 
> I'm having, however, a hard time imagining a class 2 bug for any
> of the hvmop-s that are being converted by the hvmctl series:
> These act on the target domain, so would not touch the calling
> ones state other than for copying argument structures to/from
> guest memory (and I don't view mixing up domain pointers as
> a likely source of problems).

Right.  AIUI all the hypercall arguments are passed using
calling-guest-virtual addresses, which the dom0 privcmd arrangements
are capable of ensuring are sane.

> Any other problem they might
> reasonably have would then affect the system as a whole (class
> 1) or just the target domain (non-security bug).

So would it therefore be OK to introduce the enhanced security promise
- the lack of `class 2' bugs - for HVMCTL from the beginning ?

This would involve a small amount of extra thought for each invididual
hypercall, just to check that the assumptions we are relying on (as
you put them above) are not violated.

If so then good, but I think we still need to have a proper
conversation about how we are going to provide ABI stability to qemu
in practice.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-08 13:46                             ` Ian Jackson
@ 2016-08-08 14:07                               ` Jan Beulich
  2016-08-26 11:38                                 ` Ian Jackson
  2016-08-09 10:29                               ` Jan Beulich
  1 sibling, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-08 14:07 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 08.08.16 at 15:46, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> I'm having, however, a hard time imagining a class 2 bug for any
>> of the hvmop-s that are being converted by the hvmctl series:
>> These act on the target domain, so would not touch the calling
>> ones state other than for copying argument structures to/from
>> guest memory (and I don't view mixing up domain pointers as
>> a likely source of problems).
> 
> Right.  AIUI all the hypercall arguments are passed using
> calling-guest-virtual addresses, which the dom0 privcmd arrangements
> are capable of ensuring are sane.
> 
>> Any other problem they might
>> reasonably have would then affect the system as a whole (class
>> 1) or just the target domain (non-security bug).
> 
> So would it therefore be OK to introduce the enhanced security promise
> - the lack of `class 2' bugs - for HVMCTL from the beginning ?

I think so, since ...

> This would involve a small amount of extra thought for each invididual
> hypercall, just to check that the assumptions we are relying on (as
> you put them above) are not violated.

... this looks to be a manageable amount of code auditing (albeit
I'd like to see whether someone else can perhaps come up with
some potential, more realistic kind of bug that could fall into class
2 before volunteering to make an attempt at doing such an audit).

> If so then good, but I think we still need to have a proper
> conversation about how we are going to provide ABI stability to qemu
> in practice.

Yes, the ABI stability aspect still needs coming to an agreement.
Since relaxation is easier than tightening, I'm going to retain the
current way the patches are in this regard, i.e. assume an
unstable hypercall ABI.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-08 13:46                             ` Ian Jackson
  2016-08-08 14:07                               ` Jan Beulich
@ 2016-08-09 10:29                               ` Jan Beulich
  2016-08-09 10:48                                 ` Ian Jackson
  2016-08-15 14:50                                 ` David Vrabel
  1 sibling, 2 replies; 74+ messages in thread
From: Jan Beulich @ 2016-08-09 10:29 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 08.08.16 at 15:46, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> On 05.08.16 at 18:28, <ian.jackson@eu.citrix.com> wrote:
>> > That is, a bug of class 2 would allow the unprivileged qemu process in
>> > dom0 to cause damage to other parts of dom0.
> ...
>> Ah, okay, I think I finally understand. [...]
>> 
>> I'm having, however, a hard time imagining a class 2 bug for any
>> of the hvmop-s that are being converted by the hvmctl series:
>> These act on the target domain, so would not touch the calling
>> ones state other than for copying argument structures to/from
>> guest memory (and I don't view mixing up domain pointers as
>> a likely source of problems).
> 
> Right.  AIUI all the hypercall arguments are passed using
> calling-guest-virtual addresses, which the dom0 privcmd arrangements
> are capable of ensuring are sane.

Actually, having thought about this some more, and taking this
together with the expectations to the privcmd driver previously
outlined, I think this part is problematic: If all the driver is to know
is the position (within the interface structure) of the target domain
ID, then any guest handles embedded in the interface structure
(XEN_HVMCTL_track_dirty_vram only for now) couldn't get
validated, and hence user mode code would have a way to access
or modify kernel memory.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-09 10:29                               ` Jan Beulich
@ 2016-08-09 10:48                                 ` Ian Jackson
  2016-08-09 11:30                                   ` Jan Beulich
  2016-08-15 14:50                                 ` David Vrabel
  1 sibling, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-09 10:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> Actually, having thought about this some more, and taking this
> together with the expectations to the privcmd driver previously
> outlined, I think this part is problematic: If all the driver is to know
> is the position (within the interface structure) of the target domain
> ID, then any guest handles embedded in the interface structure
> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
> validated, and hence user mode code would have a way to access
> or modify kernel memory.

Could the hypervisor know the difference between user and kernel
memory, in principle ?

Alternatively, would it be possible for the ABI to specify somehow
what parameters are guest handles, so that the privcmd driver could
check them ?  (Would it be sufficient to check the starts, or would
the ends need to be checked too?)

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-09 10:48                                 ` Ian Jackson
@ 2016-08-09 11:30                                   ` Jan Beulich
  2016-08-12  9:44                                     ` George Dunlap
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-09 11:30 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 09.08.16 at 12:48, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> Actually, having thought about this some more, and taking this
>> together with the expectations to the privcmd driver previously
>> outlined, I think this part is problematic: If all the driver is to know
>> is the position (within the interface structure) of the target domain
>> ID, then any guest handles embedded in the interface structure
>> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
>> validated, and hence user mode code would have a way to access
>> or modify kernel memory.
> 
> Could the hypervisor know the difference between user and kernel
> memory, in principle ?

Not without further new hypercalls, as the guest kernel would need
to tell Xen what address ranges are kernel vs user (and that implies
that any OS wishing to be able to act as Dom0 has a uniform
separation of address spaces).

> Alternatively, would it be possible for the ABI to specify somehow
> what parameters are guest handles, so that the privcmd driver could
> check them ?

We could presumably invent something, but I'm afraid it would end
up quite ugly.

>  (Would it be sufficient to check the starts, or would
> the ends need to be checked too?)

Both would need to be checked, so the size field(s) would need to
be locatable too.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-09 11:30                                   ` Jan Beulich
@ 2016-08-12  9:44                                     ` George Dunlap
  2016-08-12 11:50                                       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-08-12  9:44 UTC (permalink / raw)
  To: Jan Beulich, Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	David Vrabel, Anthony Perard, xen-devel, dgdegra

On 09/08/16 12:30, Jan Beulich wrote:
>>>> On 09.08.16 at 12:48, <ian.jackson@eu.citrix.com> wrote:
>> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
>> depriv)"):
>>> Actually, having thought about this some more, and taking this
>>> together with the expectations to the privcmd driver previously
>>> outlined, I think this part is problematic: If all the driver is to know
>>> is the position (within the interface structure) of the target domain
>>> ID, then any guest handles embedded in the interface structure
>>> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
>>> validated, and hence user mode code would have a way to access
>>> or modify kernel memory.
>>
>> Could the hypervisor know the difference between user and kernel
>> memory, in principle ?
> 
> Not without further new hypercalls, as the guest kernel would need
> to tell Xen what address ranges are kernel vs user (and that implies
> that any OS wishing to be able to act as Dom0 has a uniform
> separation of address spaces).

Couldn't Xen tell from the guest pagetables whether the memory being
accessed was user-mode or kernel mode?

>> Alternatively, would it be possible for the ABI to specify somehow
>> what parameters are guest handles, so that the privcmd driver could
>> check them ?
> 
> We could presumably invent something, but I'm afraid it would end
> up quite ugly.
> 
>>  (Would it be sufficient to check the starts, or would
>> the ends need to be checked too?)
> 
> Both would need to be checked, so the size field(s) would need to
> be locatable too.

We could have the "fixed" part of the structure contain domid and an
array of <ptr, len> which the privcmd driver could check.  I don't think
that would be terrible.

Alternately, the "fixed" part of the hypercall could contain a
<start,end> range, which if non-zero, Xen should use to check any
pointer contained in the struct -- that would be more flexible probably.

Or we could do as Jan hints at above -- have some way to have dom0
communicate the kernel address range to Xen (either via hypercall, or
maybe via the shared info page) so that Xen just knows the address
layout for any individual domain.

And unless we're positive the guest kernel will never need these
hypercalls, we probably need a flag that allows kernel-mode pointers.

It's worth pointing out that the problem Xen has distinguishing
user/kernel mode pointers is the same even if we use the alternate
suggestion of per-process XSM labels.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-12  9:44                                     ` George Dunlap
@ 2016-08-12 11:50                                       ` Jan Beulich
  2016-08-15  9:39                                         ` George Dunlap
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-12 11:50 UTC (permalink / raw)
  To: George Dunlap
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 12.08.16 at 11:44, <george.dunlap@citrix.com> wrote:
> On 09/08/16 12:30, Jan Beulich wrote:
>>>>> On 09.08.16 at 12:48, <ian.jackson@eu.citrix.com> wrote:
>>> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
>>> depriv)"):
>>>> Actually, having thought about this some more, and taking this
>>>> together with the expectations to the privcmd driver previously
>>>> outlined, I think this part is problematic: If all the driver is to know
>>>> is the position (within the interface structure) of the target domain
>>>> ID, then any guest handles embedded in the interface structure
>>>> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
>>>> validated, and hence user mode code would have a way to access
>>>> or modify kernel memory.
>>>
>>> Could the hypervisor know the difference between user and kernel
>>> memory, in principle ?
>> 
>> Not without further new hypercalls, as the guest kernel would need
>> to tell Xen what address ranges are kernel vs user (and that implies
>> that any OS wishing to be able to act as Dom0 has a uniform
>> separation of address spaces).
> 
> Couldn't Xen tell from the guest pagetables whether the memory being
> accessed was user-mode or kernel mode?

That would be possible, but would feel like adding heuristics instead
of a proper distinction. Clearly we'd already be in some trouble if
there were cases where some structure doesn't get written to (and
hence could live in user-r/o mapped space), but others would need
to be verified to be user-r/w mapped. A lot of special casing, that is,
and hence of lot of things to be got wrong.

And then there is the problem of calling code being in rings 1 or 2:
Page tables don't guard ring 0 against such, and we don't even have
the notion of selectors (and hence address ranges) bounding
accessible regions. We can't even say we assume all of them to be
%ds-relative, as it would certainly be legitimate for such a structure
to e.g. live on the stack. Of course an option would be to require
the kernel driver to not allow requests from other than ring 3.

Plus finally - how would we tell interface structures coming from a
kernel invoked hypercall from those originating from user mode?
There would need to be at least some kind of flag then, which the
privcmd driver set, but normal hypercalls originating in the kernel
don't. Or would you envision to allow this DMOP hypercall to only
be made by user mode tools? If so, does stubdom run its qemu in
ring 3 or rather in ring 0?

[breaking the order of quoting]
> And unless we're positive the guest kernel will never need these
> hypercalls, we probably need a flag that allows kernel-mode pointers.

Ah, you actually mention that already.

>>>  (Would it be sufficient to check the starts, or would
>>> the ends need to be checked too?)
>> 
>> Both would need to be checked, so the size field(s) would need to
>> be locatable too.
>
> We could have the "fixed" part of the structure contain domid and an
> array of <ptr, len> which the privcmd driver could check.  I don't think
> that would be terrible.

Doable, yes, but not really nice, especially for the party invoking
the hypercall as well as the backing implementation in Xen (as
opposed to the privcmd driver, for which such a model would likely
work quite well), as they  then can't use the normal, simple reading
of structure fields, but instead would need to populate array
elements in the right order.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-12 11:50                                       ` Jan Beulich
@ 2016-08-15  9:39                                         ` George Dunlap
  2016-08-15 10:19                                           ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-08-15  9:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

On 12/08/16 12:50, Jan Beulich wrote:
>>>> On 12.08.16 at 11:44, <george.dunlap@citrix.com> wrote:
>> On 09/08/16 12:30, Jan Beulich wrote:
>>>>>> On 09.08.16 at 12:48, <ian.jackson@eu.citrix.com> wrote:
>>>> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
>>>> depriv)"):
>>>>> Actually, having thought about this some more, and taking this
>>>>> together with the expectations to the privcmd driver previously
>>>>> outlined, I think this part is problematic: If all the driver is to know
>>>>> is the position (within the interface structure) of the target domain
>>>>> ID, then any guest handles embedded in the interface structure
>>>>> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
>>>>> validated, and hence user mode code would have a way to access
>>>>> or modify kernel memory.
>>>>
>>>> Could the hypervisor know the difference between user and kernel
>>>> memory, in principle ?
>>>
>>> Not without further new hypercalls, as the guest kernel would need
>>> to tell Xen what address ranges are kernel vs user (and that implies
>>> that any OS wishing to be able to act as Dom0 has a uniform
>>> separation of address spaces).
>>
>> Couldn't Xen tell from the guest pagetables whether the memory being
>> accessed was user-mode or kernel mode?
> 
> That would be possible, but would feel like adding heuristics instead
> of a proper distinction. Clearly we'd already be in some trouble if
> there were cases where some structure doesn't get written to (and
> hence could live in user-r/o mapped space), but others would need
> to be verified to be user-r/w mapped. A lot of special casing, that is,
> and hence of lot of things to be got wrong.
> 
> And then there is the problem of calling code being in rings 1 or 2:
> Page tables don't guard ring 0 against such, and we don't even have
> the notion of selectors (and hence address ranges) bounding
> accessible regions. We can't even say we assume all of them to be
> %ds-relative, as it would certainly be legitimate for such a structure
> to e.g. live on the stack. Of course an option would be to require
> the kernel driver to not allow requests from other than ring 3.
> 
> Plus finally - how would we tell interface structures coming from a
> kernel invoked hypercall from those originating from user mode?
> There would need to be at least some kind of flag then, which the
> privcmd driver set, but normal hypercalls originating in the kernel
> don't. Or would you envision to allow this DMOP hypercall to only
> be made by user mode tools? If so, does stubdom run its qemu in
> ring 3 or rather in ring 0?
> 
> [breaking the order of quoting]
>> And unless we're positive the guest kernel will never need these
>> hypercalls, we probably need a flag that allows kernel-mode pointers.
> 
> Ah, you actually mention that already.
> 
>>>>  (Would it be sufficient to check the starts, or would
>>>> the ends need to be checked too?)
>>>
>>> Both would need to be checked, so the size field(s) would need to
>>> be locatable too.
>>
>> We could have the "fixed" part of the structure contain domid and an
>> array of <ptr, len> which the privcmd driver could check.  I don't think
>> that would be terrible.
> 
> Doable, yes, but not really nice, especially for the party invoking
> the hypercall as well as the backing implementation in Xen (as
> opposed to the privcmd driver, for which such a model would likely
> work quite well), as they  then can't use the normal, simple reading
> of structure fields, but instead would need to populate array
> elements in the right order.

So on the whole, what would be your suggestion for how to solve the
userspace-pointer problem?

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15  9:39                                         ` George Dunlap
@ 2016-08-15 10:19                                           ` Jan Beulich
  2016-08-15 10:47                                             ` George Dunlap
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-15 10:19 UTC (permalink / raw)
  To: George Dunlap
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 15.08.16 at 11:39, <george.dunlap@citrix.com> wrote:
> On 12/08/16 12:50, Jan Beulich wrote:
>>>>> On 12.08.16 at 11:44, <george.dunlap@citrix.com> wrote:
>>> On 09/08/16 12:30, Jan Beulich wrote:
>>>>>>> On 09.08.16 at 12:48, <ian.jackson@eu.citrix.com> wrote:
>>>>> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
>>>>> depriv)"):
>>>>>> Actually, having thought about this some more, and taking this
>>>>>> together with the expectations to the privcmd driver previously
>>>>>> outlined, I think this part is problematic: If all the driver is to know
>>>>>> is the position (within the interface structure) of the target domain
>>>>>> ID, then any guest handles embedded in the interface structure
>>>>>> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
>>>>>> validated, and hence user mode code would have a way to access
>>>>>> or modify kernel memory.
>>>>>
>>>>> Could the hypervisor know the difference between user and kernel
>>>>> memory, in principle ?
>>>>
>>>> Not without further new hypercalls, as the guest kernel would need
>>>> to tell Xen what address ranges are kernel vs user (and that implies
>>>> that any OS wishing to be able to act as Dom0 has a uniform
>>>> separation of address spaces).
>>>
>>> Couldn't Xen tell from the guest pagetables whether the memory being
>>> accessed was user-mode or kernel mode?
>> 
>> That would be possible, but would feel like adding heuristics instead
>> of a proper distinction. Clearly we'd already be in some trouble if
>> there were cases where some structure doesn't get written to (and
>> hence could live in user-r/o mapped space), but others would need
>> to be verified to be user-r/w mapped. A lot of special casing, that is,
>> and hence of lot of things to be got wrong.
>> 
>> And then there is the problem of calling code being in rings 1 or 2:
>> Page tables don't guard ring 0 against such, and we don't even have
>> the notion of selectors (and hence address ranges) bounding
>> accessible regions. We can't even say we assume all of them to be
>> %ds-relative, as it would certainly be legitimate for such a structure
>> to e.g. live on the stack. Of course an option would be to require
>> the kernel driver to not allow requests from other than ring 3.
>> 
>> Plus finally - how would we tell interface structures coming from a
>> kernel invoked hypercall from those originating from user mode?
>> There would need to be at least some kind of flag then, which the
>> privcmd driver set, but normal hypercalls originating in the kernel
>> don't. Or would you envision to allow this DMOP hypercall to only
>> be made by user mode tools? If so, does stubdom run its qemu in
>> ring 3 or rather in ring 0?
>> 
>> [breaking the order of quoting]
>>> And unless we're positive the guest kernel will never need these
>>> hypercalls, we probably need a flag that allows kernel-mode pointers.
>> 
>> Ah, you actually mention that already.
>> 
>>>>>  (Would it be sufficient to check the starts, or would
>>>>> the ends need to be checked too?)
>>>>
>>>> Both would need to be checked, so the size field(s) would need to
>>>> be locatable too.
>>>
>>> We could have the "fixed" part of the structure contain domid and an
>>> array of <ptr, len> which the privcmd driver could check.  I don't think
>>> that would be terrible.
>> 
>> Doable, yes, but not really nice, especially for the party invoking
>> the hypercall as well as the backing implementation in Xen (as
>> opposed to the privcmd driver, for which such a model would likely
>> work quite well), as they  then can't use the normal, simple reading
>> of structure fields, but instead would need to populate array
>> elements in the right order.
> 
> So on the whole, what would be your suggestion for how to solve the
> userspace-pointer problem?

Well, none of the options considered so far are really nice or
readily available. I think the easiest to use for both the caller and
the implementation of the hypercall would be the auxiliary
hypercall for a kernel to indicate user/kernel boundaries plus a
flag on the DMOP one for the kernel mode driver to indicate its
user mode origin. The main (purely theoretical afaict) downside
of this is the difficulty to use it in OSes with variable user/kernel
boundaries.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15 10:19                                           ` Jan Beulich
@ 2016-08-15 10:47                                             ` George Dunlap
  2016-08-15 11:20                                               ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-08-15 10:47 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

On 15/08/16 11:19, Jan Beulich wrote:
>>>> On 15.08.16 at 11:39, <george.dunlap@citrix.com> wrote:
>> On 12/08/16 12:50, Jan Beulich wrote:
>>>>>> On 12.08.16 at 11:44, <george.dunlap@citrix.com> wrote:
>>>> On 09/08/16 12:30, Jan Beulich wrote:
>>>>>>>> On 09.08.16 at 12:48, <ian.jackson@eu.citrix.com> wrote:
>>>>>> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
>>>>>> depriv)"):
>>>>>>> Actually, having thought about this some more, and taking this
>>>>>>> together with the expectations to the privcmd driver previously
>>>>>>> outlined, I think this part is problematic: If all the driver is to know
>>>>>>> is the position (within the interface structure) of the target domain
>>>>>>> ID, then any guest handles embedded in the interface structure
>>>>>>> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
>>>>>>> validated, and hence user mode code would have a way to access
>>>>>>> or modify kernel memory.
>>>>>>
>>>>>> Could the hypervisor know the difference between user and kernel
>>>>>> memory, in principle ?
>>>>>
>>>>> Not without further new hypercalls, as the guest kernel would need
>>>>> to tell Xen what address ranges are kernel vs user (and that implies
>>>>> that any OS wishing to be able to act as Dom0 has a uniform
>>>>> separation of address spaces).
>>>>
>>>> Couldn't Xen tell from the guest pagetables whether the memory being
>>>> accessed was user-mode or kernel mode?
>>>
>>> That would be possible, but would feel like adding heuristics instead
>>> of a proper distinction. Clearly we'd already be in some trouble if
>>> there were cases where some structure doesn't get written to (and
>>> hence could live in user-r/o mapped space), but others would need
>>> to be verified to be user-r/w mapped. A lot of special casing, that is,
>>> and hence of lot of things to be got wrong.
>>>
>>> And then there is the problem of calling code being in rings 1 or 2:
>>> Page tables don't guard ring 0 against such, and we don't even have
>>> the notion of selectors (and hence address ranges) bounding
>>> accessible regions. We can't even say we assume all of them to be
>>> %ds-relative, as it would certainly be legitimate for such a structure
>>> to e.g. live on the stack. Of course an option would be to require
>>> the kernel driver to not allow requests from other than ring 3.
>>>
>>> Plus finally - how would we tell interface structures coming from a
>>> kernel invoked hypercall from those originating from user mode?
>>> There would need to be at least some kind of flag then, which the
>>> privcmd driver set, but normal hypercalls originating in the kernel
>>> don't. Or would you envision to allow this DMOP hypercall to only
>>> be made by user mode tools? If so, does stubdom run its qemu in
>>> ring 3 or rather in ring 0?
>>>
>>> [breaking the order of quoting]
>>>> And unless we're positive the guest kernel will never need these
>>>> hypercalls, we probably need a flag that allows kernel-mode pointers.
>>>
>>> Ah, you actually mention that already.
>>>
>>>>>>  (Would it be sufficient to check the starts, or would
>>>>>> the ends need to be checked too?)
>>>>>
>>>>> Both would need to be checked, so the size field(s) would need to
>>>>> be locatable too.
>>>>
>>>> We could have the "fixed" part of the structure contain domid and an
>>>> array of <ptr, len> which the privcmd driver could check.  I don't think
>>>> that would be terrible.
>>>
>>> Doable, yes, but not really nice, especially for the party invoking
>>> the hypercall as well as the backing implementation in Xen (as
>>> opposed to the privcmd driver, for which such a model would likely
>>> work quite well), as they  then can't use the normal, simple reading
>>> of structure fields, but instead would need to populate array
>>> elements in the right order.
>>
>> So on the whole, what would be your suggestion for how to solve the
>> userspace-pointer problem?
> 
> Well, none of the options considered so far are really nice or
> readily available. I think the easiest to use for both the caller and
> the implementation of the hypercall would be the auxiliary
> hypercall for a kernel to indicate user/kernel boundaries plus a
> flag on the DMOP one for the kernel mode driver to indicate its
> user mode origin. The main (purely theoretical afaict) downside
> of this is the difficulty to use it in OSes with variable user/kernel
> boundaries.

What about including in the "fixed" part of the hypercall a virtual
address range that all pointers must be in?  That wouldn't even require
a user/kernel flag actually; and could conceivably be used by the caller
(either userspace or kernel space) to thwart certain kinds of potential
attacks.

It would take changing the copy_guest() macros to include a potential
range argument, but that shouldn't be too intrusive on the whole, I
wouldn't think.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15 10:47                                             ` George Dunlap
@ 2016-08-15 11:20                                               ` Jan Beulich
  2016-08-15 12:07                                                 ` Ian Jackson
  2016-08-15 14:57                                                 ` George Dunlap
  0 siblings, 2 replies; 74+ messages in thread
From: Jan Beulich @ 2016-08-15 11:20 UTC (permalink / raw)
  To: George Dunlap
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 15.08.16 at 12:47, <george.dunlap@citrix.com> wrote:
> On 15/08/16 11:19, Jan Beulich wrote:
>> Well, none of the options considered so far are really nice or
>> readily available. I think the easiest to use for both the caller and
>> the implementation of the hypercall would be the auxiliary
>> hypercall for a kernel to indicate user/kernel boundaries plus a
>> flag on the DMOP one for the kernel mode driver to indicate its
>> user mode origin. The main (purely theoretical afaict) downside
>> of this is the difficulty to use it in OSes with variable user/kernel
>> boundaries.
> 
> What about including in the "fixed" part of the hypercall a virtual
> address range that all pointers must be in?  That wouldn't even require
> a user/kernel flag actually; and could conceivably be used by the caller
> (either userspace or kernel space) to thwart certain kinds of potential
> attacks.

That's definitely an option, if we're sufficiently certain that no OSes
will ever require two or more ranges.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15 11:20                                               ` Jan Beulich
@ 2016-08-15 12:07                                                 ` Ian Jackson
  2016-08-15 14:20                                                   ` Jan Beulich
  2016-08-15 14:57                                                 ` George Dunlap
  1 sibling, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-15 12:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	George Dunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On 15.08.16 at 12:47, <george.dunlap@citrix.com> wrote:
> > What about including in the "fixed" part of the hypercall a virtual
> > address range that all pointers must be in?  That wouldn't even require
> > a user/kernel flag actually; and could conceivably be used by the caller
> > (either userspace or kernel space) to thwart certain kinds of potential
> > attacks.
> 
> That's definitely an option, if we're sufficiently certain that no OSes
> will ever require two or more ranges.

How hard would it be to allow the caller to specify several allowable
ranges ?

Note that the hypercall argument construction code in libxc already
has to handle all hypercall argument memory specially, so libxc could
automatically build a list of the arguments' memory addresses.

What would be needed is some kind of restriction on (or variant of)
copy_* which double-checked against the list provided in the
non-op-specific part of the hypercall.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15 12:07                                                 ` Ian Jackson
@ 2016-08-15 14:20                                                   ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2016-08-15 14:20 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	George Dunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 15.08.16 at 14:07, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> On 15.08.16 at 12:47, <george.dunlap@citrix.com> wrote:
>> > What about including in the "fixed" part of the hypercall a virtual
>> > address range that all pointers must be in?  That wouldn't even require
>> > a user/kernel flag actually; and could conceivably be used by the caller
>> > (either userspace or kernel space) to thwart certain kinds of potential
>> > attacks.
>> 
>> That's definitely an option, if we're sufficiently certain that no OSes
>> will ever require two or more ranges.
> 
> How hard would it be to allow the caller to specify several allowable
> ranges ?

Not very hard - just like with so many things an additional level
of indirection would do.

> Note that the hypercall argument construction code in libxc already
> has to handle all hypercall argument memory specially, so libxc could
> automatically build a list of the arguments' memory addresses.
> 
> What would be needed is some kind of restriction on (or variant of)
> copy_* which double-checked against the list provided in the
> non-op-specific part of the hypercall.

Yeah, as George already mentioned. I'd favor "variant of", to avoid
penalizing all other callers.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-09 10:29                               ` Jan Beulich
  2016-08-09 10:48                                 ` Ian Jackson
@ 2016-08-15 14:50                                 ` David Vrabel
  2016-08-15 15:24                                   ` Jan Beulich
  1 sibling, 1 reply; 74+ messages in thread
From: David Vrabel @ 2016-08-15 14:50 UTC (permalink / raw)
  To: Jan Beulich, Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, Anthony Perard, xen-devel, dgdegra

On 09/08/16 11:29, Jan Beulich wrote:
>>>> On 08.08.16 at 15:46, <ian.jackson@eu.citrix.com> wrote:
>> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
>> depriv)"):
>>> On 05.08.16 at 18:28, <ian.jackson@eu.citrix.com> wrote:
>>>> That is, a bug of class 2 would allow the unprivileged qemu process in
>>>> dom0 to cause damage to other parts of dom0.
>> ...
>>> Ah, okay, I think I finally understand. [...]
>>>
>>> I'm having, however, a hard time imagining a class 2 bug for any
>>> of the hvmop-s that are being converted by the hvmctl series:
>>> These act on the target domain, so would not touch the calling
>>> ones state other than for copying argument structures to/from
>>> guest memory (and I don't view mixing up domain pointers as
>>> a likely source of problems).
>>
>> Right.  AIUI all the hypercall arguments are passed using
>> calling-guest-virtual addresses, which the dom0 privcmd arrangements
>> are capable of ensuring are sane.
> 
> Actually, having thought about this some more, and taking this
> together with the expectations to the privcmd driver previously
> outlined, I think this part is problematic: If all the driver is to know
> is the position (within the interface structure) of the target domain
> ID, then any guest handles embedded in the interface structure
> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
> validated, and hence user mode code would have a way to access
> or modify kernel memory.

It seems simpler to me to have in the privcmd driver:

    if (op == HVMCTL_track_dirty_vram)
        ret = access_ok(...);

It looks simpler to me to fix the ABI and add the checking in the
privcmd driver than add one of the proposed mechanisms to allow the
hypervisor to do the checking.

To avoid the issues with having to update the kernel in lock-step with
the hypervisor (if new checks are needed in privcmd), we can (in the
common case) do version the checking in the driver.

i.e., if the privcmd drivers knows about hypervisor ABI V but qemu needs
V+1 then it can choose to disable the depriv and thus continue to work
(with reduced defense in depth).

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15 11:20                                               ` Jan Beulich
  2016-08-15 12:07                                                 ` Ian Jackson
@ 2016-08-15 14:57                                                 ` George Dunlap
  2016-08-15 15:22                                                   ` Jan Beulich
  1 sibling, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-08-15 14:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

On 15/08/16 12:20, Jan Beulich wrote:
>>>> On 15.08.16 at 12:47, <george.dunlap@citrix.com> wrote:
>> On 15/08/16 11:19, Jan Beulich wrote:
>>> Well, none of the options considered so far are really nice or
>>> readily available. I think the easiest to use for both the caller and
>>> the implementation of the hypercall would be the auxiliary
>>> hypercall for a kernel to indicate user/kernel boundaries plus a
>>> flag on the DMOP one for the kernel mode driver to indicate its
>>> user mode origin. The main (purely theoretical afaict) downside
>>> of this is the difficulty to use it in OSes with variable user/kernel
>>> boundaries.
>>
>> What about including in the "fixed" part of the hypercall a virtual
>> address range that all pointers must be in?  That wouldn't even require
>> a user/kernel flag actually; and could conceivably be used by the caller
>> (either userspace or kernel space) to thwart certain kinds of potential
>> attacks.
> 
> That's definitely an option, if we're sufficiently certain that no OSes
> will ever require two or more ranges.

I'm sorry, I think this is getting a bit silly.  There are currently no
known operating systems which have discontinuous user-space virtual
address ranges.  Even if there were, the hypercall structure I'm
proposing would still function; the only restriction would be that any
single hypercall would have to have all arguments within one of those
ranges.

If we find such a monster in the future, we can try to figure out how to
accommodate it at that point.

Another idea I had was to do a multi-call-style hypercall "wrapper"
hypercall, similar to David's XSM idea, but instead of running with a
specific XSM label, would restrict the target domain and VA range of all
included hypercalls.  Then the individual hypercall structure wouldn't
need to be known at all; and if it ever became important to have (say)
two VA ranges, we could simply add a new "wrapper" hypercall.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15 14:57                                                 ` George Dunlap
@ 2016-08-15 15:22                                                   ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2016-08-15 15:22 UTC (permalink / raw)
  To: George Dunlap
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	Tim Deegan, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 15.08.16 at 16:57, <george.dunlap@citrix.com> wrote:
> On 15/08/16 12:20, Jan Beulich wrote:
>>>>> On 15.08.16 at 12:47, <george.dunlap@citrix.com> wrote:
>>> On 15/08/16 11:19, Jan Beulich wrote:
>>>> Well, none of the options considered so far are really nice or
>>>> readily available. I think the easiest to use for both the caller and
>>>> the implementation of the hypercall would be the auxiliary
>>>> hypercall for a kernel to indicate user/kernel boundaries plus a
>>>> flag on the DMOP one for the kernel mode driver to indicate its
>>>> user mode origin. The main (purely theoretical afaict) downside
>>>> of this is the difficulty to use it in OSes with variable user/kernel
>>>> boundaries.
>>>
>>> What about including in the "fixed" part of the hypercall a virtual
>>> address range that all pointers must be in?  That wouldn't even require
>>> a user/kernel flag actually; and could conceivably be used by the caller
>>> (either userspace or kernel space) to thwart certain kinds of potential
>>> attacks.
>> 
>> That's definitely an option, if we're sufficiently certain that no OSes
>> will ever require two or more ranges.
> 
> I'm sorry, I think this is getting a bit silly.  There are currently no
> known operating systems which have discontinuous user-space virtual
> address ranges.  Even if there were, the hypercall structure I'm
> proposing would still function; the only restriction would be that any
> single hypercall would have to have all arguments within one of those
> ranges.

Well, you then discount the original vDSO range 64-bit Linux had
high up in the fixmap area. Not that this area would be usable for
any bad here, but it shows that such an idea isn't pure theory.

> Another idea I had was to do a multi-call-style hypercall "wrapper"
> hypercall, similar to David's XSM idea, but instead of running with a
> specific XSM label, would restrict the target domain and VA range of all
> included hypercalls.  Then the individual hypercall structure wouldn't
> need to be known at all; and if it ever became important to have (say)
> two VA ranges, we could simply add a new "wrapper" hypercall.

That should work too, yes.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15 14:50                                 ` David Vrabel
@ 2016-08-15 15:24                                   ` Jan Beulich
  2016-08-26 11:29                                     ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-15 15:24 UTC (permalink / raw)
  To: David Vrabel
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	GeorgeDunlap, Tim Deegan, Anthony Perard, xen-devel, dgdegra

>>> On 15.08.16 at 16:50, <david.vrabel@citrix.com> wrote:
> On 09/08/16 11:29, Jan Beulich wrote:
>>>>> On 08.08.16 at 15:46, <ian.jackson@eu.citrix.com> wrote:
>>> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
>>> depriv)"):
>>>> On 05.08.16 at 18:28, <ian.jackson@eu.citrix.com> wrote:
>>>>> That is, a bug of class 2 would allow the unprivileged qemu process in
>>>>> dom0 to cause damage to other parts of dom0.
>>> ...
>>>> Ah, okay, I think I finally understand. [...]
>>>>
>>>> I'm having, however, a hard time imagining a class 2 bug for any
>>>> of the hvmop-s that are being converted by the hvmctl series:
>>>> These act on the target domain, so would not touch the calling
>>>> ones state other than for copying argument structures to/from
>>>> guest memory (and I don't view mixing up domain pointers as
>>>> a likely source of problems).
>>>
>>> Right.  AIUI all the hypercall arguments are passed using
>>> calling-guest-virtual addresses, which the dom0 privcmd arrangements
>>> are capable of ensuring are sane.
>> 
>> Actually, having thought about this some more, and taking this
>> together with the expectations to the privcmd driver previously
>> outlined, I think this part is problematic: If all the driver is to know
>> is the position (within the interface structure) of the target domain
>> ID, then any guest handles embedded in the interface structure
>> (XEN_HVMCTL_track_dirty_vram only for now) couldn't get
>> validated, and hence user mode code would have a way to access
>> or modify kernel memory.
> 
> It seems simpler to me to have in the privcmd driver:
> 
>     if (op == HVMCTL_track_dirty_vram)
>         ret = access_ok(...);
> 
> It looks simpler to me to fix the ABI and add the checking in the
> privcmd driver than add one of the proposed mechanisms to allow the
> hypervisor to do the checking.

Simpler, yes, but ...

> To avoid the issues with having to update the kernel in lock-step with
> the hypervisor (if new checks are needed in privcmd), we can (in the
> common case) do version the checking in the driver.
> 
> i.e., if the privcmd drivers knows about hypervisor ABI V but qemu needs
> V+1 then it can choose to disable the depriv and thus continue to work
> (with reduced defense in depth).

... less flexible, and prone to end up in a mess once we have more
than a handful of versions for the driver to deal with.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-15 15:24                                   ` Jan Beulich
@ 2016-08-26 11:29                                     ` Ian Jackson
  2016-08-26 12:58                                       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-26 11:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On 15.08.16 at 16:50, <david.vrabel@citrix.com> wrote:
> > It seems simpler to me to have in the privcmd driver:
> > 
> >     if (op == HVMCTL_track_dirty_vram)
> >         ret = access_ok(...);
> > 
> > It looks simpler to me to fix the ABI and add the checking in the
> > privcmd driver than add one of the proposed mechanisms to allow the
> > hypervisor to do the checking.
> 
> Simpler, yes, but ...
> 
> > To avoid the issues with having to update the kernel in lock-step with
> > the hypervisor (if new checks are needed in privcmd), we can (in the
> > common case) do version the checking in the driver.
> > 
> > i.e., if the privcmd drivers knows about hypervisor ABI V but qemu needs
> > V+1 then it can choose to disable the depriv and thus continue to work
> > (with reduced defense in depth).
> 
> ... less flexible, and prone to end up in a mess once we have more
> than a handful of versions for the driver to deal with.

I agree.

To summarise the current proposal:

DMOP is roughly the same as Jan's HVMCTL.  It needs two
enhancements compared to HVMCTL:

 * Each hypercall transferred to DMOP[1] must promise not to do bad
   things to the calling domain (or to the whole system).  There will
   need to be some minimal audit (or consideration of the hypercall's
   functionality) for this purpose.  Jan thinks this is not too much
   work but wants an example or two of a plausible vulnerability that
   needs to be excluded.

 * We will need to enable the privcmd driver to defend dom0's memory.

   This will be done by: the privcmd driver specifying to the
   hypervisor (either inside each DMOP hypercall struct, or in a
   previous hypercall) a single address range which is permitted for
   this purpose.  (This would be the vaddr range which is used for
   userspace.)

   If the range is specified separately, the DMOP hypercall needs a
   flag to say whether this specific DMOP call ought to be subject to
   the restriction.


We also discussed compatibility.  We need to present a stable ABI and
API to qemu.  We considered making a limited stability promise for the
DMOP hypercalls, but this would involve the hypervisor supporting
multiple old versions of DMOPs which is not ideal from a security
point of view.  Instead, it is proposed that this will be dealt with
in libxc.

In more detail, I think the plan has then to look something like this:

Suppose a DMOP is introduced in Xen version K, and then modified in a
non-ABI-compatible way Xen version K+1.

Xen K+1 implements only the new ABI, and Xen K provides only the old.
libxc in K provides only the single entrypoint for the old DMOP.

libxc in K+1 provides a new entrypoint for the new DMOP (with enhanced
parameters, or whatever).  This new entrypoint needs no compatibility
handling and works only only Xen >= K+1 (and otherwise returns
ENOSYS).

However, libxc K+1 it also provides the old entrypoint.  This old
entrypoint checks the hypervisor DMOP ABI version, and makes either
the old or new hypercall.

For this to work, the ABI definition for the retired DMOP must remain
in xen.git somewhere, so that the code in libxc can call it.

Also, there should be a single DMOP ABI version which can be retrieved
by the hypervisor and cached by libxc.  This could be the hypervisor
ABI version if we don't mind breakage during development.


Is this plan a good one for everyone ?

Thanks,
Ian.


[1] By this I don't mean to take a point of view about the name.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-08 14:07                               ` Jan Beulich
@ 2016-08-26 11:38                                 ` Ian Jackson
  2016-08-26 12:58                                   ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-26 11:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On 08.08.16 at 15:46, <ian.jackson@eu.citrix.com> wrote:
> > So would it therefore be OK to introduce the enhanced security promise
> > - the lack of `class 2' bugs - for HVMCTL from the beginning ?
> 
> I think so, since ...
> 
> > This would involve a small amount of extra thought for each invididual
> > hypercall, just to check that the assumptions we are relying on (as
> > you put them above) are not violated.
> 
> ... this looks to be a manageable amount of code auditing (albeit
> I'd like to see whether someone else can perhaps come up with
> some potential, more realistic kind of bug that could fall into class
> 2 before volunteering to make an attempt at doing such an audit).

Right.

Let me try to think of some examples.  Thinking `aloud':

The real problem comes if a DMOP talks about the calling domain's
resources or namespaces, implicitly or explicitly.

An example of an explicit reference to the calling domain's resources
is the references to memory space in the calling domain (vaddrs).  We
have already had an extensive discussion of that...

Another example would be a DMOP that takes (or returns) an event
channel number in the calling domain.  This would be a problem because
there would be nothing to stop qemu from messing about with evtchns
which dom0 is using for other purposes (or conversely, there would be
no way for the dom0 evtchn driver to know about the returned evtchn
number and allow qemu to receive it).

Another might be a DMOP that implicitly grants the target domain some
of the calling domain's scheduling priority.  (I realise this is quite
implausible from a scheduling API POV, but it gives an idea.)

Another example is that of course VCPU pool management and VCPU-PCPU
pinning must not be available via DMOP.

(I write `qemu' here for brevity and clarity, but really I mean any
DMOP caller which is supposed to be privileged for the target domain
but not generally privileged.)

Does that help at all ?

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-26 11:38                                 ` Ian Jackson
@ 2016-08-26 12:58                                   ` Jan Beulich
  2016-08-26 14:35                                     ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-26 12:58 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 26.08.16 at 13:38, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> On 08.08.16 at 15:46, <ian.jackson@eu.citrix.com> wrote:
>> > So would it therefore be OK to introduce the enhanced security promise
>> > - the lack of `class 2' bugs - for HVMCTL from the beginning ?
>> 
>> I think so, since ...
>> 
>> > This would involve a small amount of extra thought for each invididual
>> > hypercall, just to check that the assumptions we are relying on (as
>> > you put them above) are not violated.
>> 
>> ... this looks to be a manageable amount of code auditing (albeit
>> I'd like to see whether someone else can perhaps come up with
>> some potential, more realistic kind of bug that could fall into class
>> 2 before volunteering to make an attempt at doing such an audit).
> 
> Right.
> 
> Let me try to think of some examples.  Thinking `aloud':
> 
> The real problem comes if a DMOP talks about the calling domain's
> resources or namespaces, implicitly or explicitly.
> 
> An example of an explicit reference to the calling domain's resources
> is the references to memory space in the calling domain (vaddrs).  We
> have already had an extensive discussion of that...
> 
> Another example would be a DMOP that takes (or returns) an event
> channel number in the calling domain.  This would be a problem because
> there would be nothing to stop qemu from messing about with evtchns
> which dom0 is using for other purposes (or conversely, there would be
> no way for the dom0 evtchn driver to know about the returned evtchn
> number and allow qemu to receive it).

Doesn't that follow the more general "mixing up own and target
domains" pattern, which is relatively easy to audit for?

> Another might be a DMOP that implicitly grants the target domain some
> of the calling domain's scheduling priority.  (I realise this is quite
> implausible from a scheduling API POV, but it gives an idea.)
> 
> Another example is that of course VCPU pool management and VCPU-PCPU
> pinning must not be available via DMOP.
> 
> (I write `qemu' here for brevity and clarity, but really I mean any
> DMOP caller which is supposed to be privileged for the target domain
> but not generally privileged.)

These all look rather contrived, especially keeping in mind that
what we mean to exclude right now are accidental violations of
the intended isolation. I.e. I think for all of those one would need
to go to some lengths to actually achieve the "goal", but they are
rather unlikely to be the result of a bug.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-26 11:29                                     ` Ian Jackson
@ 2016-08-26 12:58                                       ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2016-08-26 12:58 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 26.08.16 at 13:29, <ian.jackson@eu.citrix.com> wrote:
> Is this plan a good one for everyone ?

Sounds reasonable to me; just needs settling on a few of the actual
details.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-26 12:58                                   ` Jan Beulich
@ 2016-08-26 14:35                                     ` Ian Jackson
  2016-08-26 15:13                                       ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-08-26 14:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> On 26.08.16 at 13:38, <ian.jackson@eu.citrix.com> wrote:
> > Another example would be a DMOP that takes (or returns) an event
> > channel number in the calling domain.  This would be a problem because
> > there would be nothing to stop qemu from messing about with evtchns
> > which dom0 is using for other purposes (or conversely, there would be
> > no way for the dom0 evtchn driver to know about the returned evtchn
> > number and allow qemu to receive it).
> 
> Doesn't that follow the more general "mixing up own and target
> domains" pattern, which is relatively easy to audit for?

Yes, as I understand what you mean by that pattern, indeed.

> > Another might be a DMOP that implicitly grants the target domain some
> > of the calling domain's scheduling priority.  (I realise this is quite
> > implausible from a scheduling API POV, but it gives an idea.)
> > 
> > Another example is that of course VCPU pool management and VCPU-PCPU
> > pinning must not be available via DMOP.
> > 
> > (I write `qemu' here for brevity and clarity, but really I mean any
> > DMOP caller which is supposed to be privileged for the target domain
> > but not generally privileged.)
> 
> These all look rather contrived, especially keeping in mind that
> what we mean to exclude right now are accidental violations of
> the intended isolation. I.e. I think for all of those one would need
> to go to some lengths to actually achieve the "goal", but they are
> rather unlikely to be the result of a bug.

Right.

So I think this confirms your conclusion that this "audit" (ie,
checking that there are problems in these kind of categories) won't be
very difficult ?

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-26 14:35                                     ` Ian Jackson
@ 2016-08-26 15:13                                       ` Jan Beulich
  2016-08-30 11:02                                         ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-08-26 15:13 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

>>> On 26.08.16 at 16:35, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu 
> depriv)"):
>> On 26.08.16 at 13:38, <ian.jackson@eu.citrix.com> wrote:
>> > Another example would be a DMOP that takes (or returns) an event
>> > channel number in the calling domain.  This would be a problem because
>> > there would be nothing to stop qemu from messing about with evtchns
>> > which dom0 is using for other purposes (or conversely, there would be
>> > no way for the dom0 evtchn driver to know about the returned evtchn
>> > number and allow qemu to receive it).
>> 
>> Doesn't that follow the more general "mixing up own and target
>> domains" pattern, which is relatively easy to audit for?
> 
> Yes, as I understand what you mean by that pattern, indeed.
> 
>> > Another might be a DMOP that implicitly grants the target domain some
>> > of the calling domain's scheduling priority.  (I realise this is quite
>> > implausible from a scheduling API POV, but it gives an idea.)
>> > 
>> > Another example is that of course VCPU pool management and VCPU-PCPU
>> > pinning must not be available via DMOP.
>> > 
>> > (I write `qemu' here for brevity and clarity, but really I mean any
>> > DMOP caller which is supposed to be privileged for the target domain
>> > but not generally privileged.)
>> 
>> These all look rather contrived, especially keeping in mind that
>> what we mean to exclude right now are accidental violations of
>> the intended isolation. I.e. I think for all of those one would need
>> to go to some lengths to actually achieve the "goal", but they are
>> rather unlikely to be the result of a bug.
> 
> Right.
> 
> So I think this confirms your conclusion that this "audit" (ie,
> checking that there are problems in these kind of categories) won't be
> very difficult ?

Well, in a way. And then not: Initially I had thought no issue would
arise, until I came to realize the kernel memory corruption potential.
Question is whether now we're overlooking some other not
immediately obvious issue. The problem with auditing is that
generally you can only look for things you're aware of (or magically
become aware of while looking at the code). But I guess we should
just go ahead with the patterns we know of.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-26 15:13                                       ` Jan Beulich
@ 2016-08-30 11:02                                         ` Ian Jackson
  2016-08-30 21:47                                           ` Stefano Stabellini
  2016-09-02 14:08                                           ` Wei Liu
  0 siblings, 2 replies; 74+ messages in thread
From: Ian Jackson @ 2016-08-30 11:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> Well, in a way. And then not: Initially I had thought no issue would
> arise, until I came to realize the kernel memory corruption potential.
> Question is whether now we're overlooking some other not
> immediately obvious issue. The problem with auditing is that
> generally you can only look for things you're aware of (or magically
> become aware of while looking at the code). But I guess we should
> just go ahead with the patterns we know of.

I think so, yes.  I will take a look at the interfaces, at least, to
see if I can spot anything missing.  This will probably generate some
more stupid questions...

So, then, is everyone now happy with the overall approach ?  That is,
as I wrote up in:
  Message-ID: <22464.10246.708893.563258@mariner.uk.xensource.com>
  Subject: Re: Device model operation hypercall (DMOP, re qemu depriv)
  Date: Fri, 26 Aug 2016 12:29:10 +0100

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-30 11:02                                         ` Ian Jackson
@ 2016-08-30 21:47                                           ` Stefano Stabellini
  2016-09-02 14:08                                           ` Wei Liu
  1 sibling, 0 replies; 74+ messages in thread
From: Stefano Stabellini @ 2016-08-30 21:47 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Jan Beulich, Anthony Perard,
	xen-devel, dgdegra

On Tue, 30 Aug 2016, Ian Jackson wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> > Well, in a way. And then not: Initially I had thought no issue would
> > arise, until I came to realize the kernel memory corruption potential.
> > Question is whether now we're overlooking some other not
> > immediately obvious issue. The problem with auditing is that
> > generally you can only look for things you're aware of (or magically
> > become aware of while looking at the code). But I guess we should
> > just go ahead with the patterns we know of.
> 
> I think so, yes.  I will take a look at the interfaces, at least, to
> see if I can spot anything missing.  This will probably generate some
> more stupid questions...
> 
> So, then, is everyone now happy with the overall approach ?  That is,
> as I wrote up in:
>   Message-ID: <22464.10246.708893.563258@mariner.uk.xensource.com>
>   Subject: Re: Device model operation hypercall (DMOP, re qemu depriv)
>   Date: Fri, 26 Aug 2016 12:29:10 +0100

I think it looks good

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-30 11:02                                         ` Ian Jackson
  2016-08-30 21:47                                           ` Stefano Stabellini
@ 2016-09-02 14:08                                           ` Wei Liu
  1 sibling, 0 replies; 74+ messages in thread
From: Wei Liu @ 2016-09-02 14:08 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Tim Deegan,
	GeorgeDunlap, David Vrabel, Jan Beulich, Anthony Perard,
	xen-devel, dgdegra

On Tue, Aug 30, 2016 at 12:02:26PM +0100, Ian Jackson wrote:
> Jan Beulich writes ("Re: Device model operation hypercall (DMOP, re qemu depriv)"):
> > Well, in a way. And then not: Initially I had thought no issue would
> > arise, until I came to realize the kernel memory corruption potential.
> > Question is whether now we're overlooking some other not
> > immediately obvious issue. The problem with auditing is that
> > generally you can only look for things you're aware of (or magically
> > become aware of while looking at the code). But I guess we should
> > just go ahead with the patterns we know of.
> 
> I think so, yes.  I will take a look at the interfaces, at least, to
> see if I can spot anything missing.  This will probably generate some
> more stupid questions...
> 
> So, then, is everyone now happy with the overall approach ?  That is,
> as I wrote up in:
>   Message-ID: <22464.10246.708893.563258@mariner.uk.xensource.com>
>   Subject: Re: Device model operation hypercall (DMOP, re qemu depriv)
>   Date: Fri, 26 Aug 2016 12:29:10 +0100
> 

It looks like a reasonable plan to me.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-08-01 11:32 ` Device model operation hypercall (DMOP, re qemu depriv) Ian Jackson
  2016-08-01 12:41   ` Jan Beulich
  2016-08-02 11:37   ` Wei Liu
@ 2016-09-09 15:16   ` Jennifer Herbert
  2016-09-09 15:34     ` David Vrabel
                       ` (2 more replies)
  2 siblings, 3 replies; 74+ messages in thread
From: Jennifer Herbert @ 2016-09-09 15:16 UTC (permalink / raw)
  To: Ian Jackson, Jan Beulich, Andrew Cooper, dgdegra, Wei Liu,
	George Dunlap, Stefano Stabellini, Tim Deegan, David Vrabel,
	Anthony Perard, Konrad Rzeszutek Wilk
  Cc: xen-devel

On 01/08/16 12:32, Ian Jackson wrote:
> I think we need to introduce a new hypercall (which I will call DMOP
> for now) which may augment or replace some of HVMCTL.  Let me explain:
>

I believe the new 'DMOP' hypercall is a good idea,  but following on
from discussions, I propose a revised design, which I present below.
Please let me know what you think.

Thanks,
Jenny.


DMOP  (multi-buffer variant)
============================


Introduction
------------

A previous proposal for a 'DMOP' was put forward by Ian Jackson on the 1st
of August. This proposal seem very promising, however a problem was
identified with it, that this proposal addresses.

The aim of DMOP, as before, is to prevent a compromised device model from
compromising domains other then the one it is associated with. (And is
therefore likely already compromised)

The previous proposal adds a DMOP hypercall, for use by the device models,
which places the domain ID in a fixed place, within the calling args,
such that the priv driver can always find it, and not need to know any
further details about the particular DMOP in order to validate it against
previously set (vie ioctl) domain.

The problem occurs when you have a DMOP with references to other user memory
objects, such as with Track dirty VRAM (As used with the VGA buffer).
Is this case, the address of this other user object needs to be vetted,
to ensure it is not within a restricted address ranges, such as kernel
memory. The real problem comes down to how you would vet this address -
the idea place to do this is within the privcmd driver, since it would have
knowledge of the address space involved. However, with a principal goal
of DMOP to allow privcmd to be free from any knowledge of DMOP's sub ops,
it would have no way to identify any user buffer  addresses that need
checking.  The alternative of allowing the hypervisor to vet the address
is also problematic, since it has no knowledge of the guest memory layout.


The Design
----------

As with the previous design, we provide a new restriction ioctl, which
takes a domid parameter.  After that restriction ioctl is called, the
privcmd driver will permit only DMOP hypercalls, and only with the
specified target domid.

In the previous design, a DMOP consisted of one buffer, containing all of
the DMOP parameters, which may include further explicit references to
more buffers.  In this design, an array of buffers with length is presented,
with the first one containing the DMOP parameters, which could implicitly
reference to further buffers from within in the array. Here, the only
user buffers passed, are that found with the array, and so all can be 
audited
from privcmd.  With the passing of the length of the buffers array, privcmd
does not need to know which DMOP it is to audit them.

If the hypervisor ends up with the wrong number of buffers, it can reject
the DMOP at that point.

The following code illustrates this idea:

typedef struct dm_op_buffer {
     XEN_GUEST_HANDLE(void) h;
     size_t len;
} dm_op_buffer_t;

int
HYPERVISOR_device_model_op(
     domid_t domid,
     unsigned int nr_buffers,
     XEN_GUEST_HANDLE_PARAM(dm_op_buffer_t) buffers)

@domid: the domain the hypercall operates on.
@nr_buffers; the number of buffers in the @buffers array.

@buffers: an array of buffers.  @buffers[0] contains device_model_op - the
structure describing the sub-op and its parameters. @buffers[1], @buffers[2]
etc. may be used by a sub-op for passing additional buffers.

struct device_model_op {
     uint32_t op;
     union {
          struct op_1 op1;
          struct op_2 op2;
          /* etc... */
     } u;
};

It is forbidden for the above struct (device_model_op) to contain any
guest handles - if they are needed, they should instead be in
HYPERVISOR_device_model_op->buffers.

Validation by privcmd driver
------------------------------

If the privcmd driver has been restricted to specific domain (using a
  new ioctl), when it received an op, it will:

1. Check hypercall is DMOP.

2. Check domid == restricted domid.

3. For each @nr_buffers in @buffers: Check @h and @len give a buffer
    wholey in the user space part of the virtual address space. (e.g.,
    on Linux use access_ok()).


Xen Implementation
------------------

Since a DMOP sub of may need to copy or return a buffer from the guest,
as well as the DMOP itself to git the initial buffer, functions for doing
this would be written as below.  Note that care is taken to prevent
damage from buffer under or over run situations.  If the DMOP is called
with insufficient buffers, zeros will be read, while extra is ignored.


int copy_dm_buffer_from_guest(
     void *dst,                        /* Kernel destination buffer      */
     size_t max_len,                   /* Size of destination buffer     */
     XEN_GUEST_HANDLE_PARAM(dm_op_buffer_t) buffers,
                                       /* dm_op_buffers passed into DMOP */
     unsigned int nr_buffers,          /* total number of dm_op_buffers  */
     unsigned int idx)                 /* Index of buffer we require     */
{
         struct dm_op_buffer buffer;
         int ret;
         size_t len;

         memset(dst, 0, max_len);

         if (idx>=nr_buffers)
             return -EFAULT;

         ret = copy_from_guest_offset(&buffer, buffers, idx, 1);
         if ( ret != sizeof(buffer) )
             return -EFAULT;

         len = min(max_len, buffer->len);

         ret = raw_copy_from_guest(dst, buffer->h, len);
         if ( ret != len )
             return -EFAULT;

         return 0;
}

int copy_dm_buffer_to_guest(...)
{
     /* Similar to the above, except copying the the other
        direction. */
}

This leaves do_device_model_op easy to implement as below:


int do_device_model_op(domid_t domid,
    unsigned int nr_buffers,
    XEN_GUEST_HANDLE_PARAM(dm_op_buffer_t) buffers)
{
     struct device_model_op op;

     ret = copy_dm_buffer_from_guest(&op, sizeof(op), buffers, 
nr_buffers, 0);
     if ( ret < 0 )
         return ret;

     switch (op->op)
     {
         case DMOP_sub_op1:
             /* ... */
         break;
         /* etc. */
     }

     return 0;
}


Summary
--------

Advantages of this system, over previouse DMOP proposals:


*  The validation of address ranges is easily done by the privcmd driver,
    using standard kernel mechanisms.  No need to get Xen thinking about
    guest memory layout, which it should be independent of, and potentially
    adding confusion.

*  Arbitrary number of additional address ranges validated with same
    mechanism as the initial parameter block.


*  No need for any new copy_from_guest() variants in the hypervisor, which
    among other things, prevents code using the wrong one by error,
    potentially bypassing security.


And as with the original DMOP proposal:

*  The Privcmd driver, or any other kernel parts will not need to be
    updated when new DMOPs are added or changed.


Disadvantages of this system:

* Minor stylistic issue relating to buffers being implicitly referred to.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-09 15:16   ` Jennifer Herbert
@ 2016-09-09 15:34     ` David Vrabel
  2016-09-12 13:47     ` George Dunlap
  2016-09-12 14:32     ` Jan Beulich
  2 siblings, 0 replies; 74+ messages in thread
From: David Vrabel @ 2016-09-09 15:34 UTC (permalink / raw)
  To: Jennifer Herbert, Ian Jackson, Jan Beulich, Andrew Cooper,
	dgdegra, Wei Liu, George Dunlap, Stefano Stabellini, Tim Deegan,
	David Vrabel, Anthony Perard, Konrad Rzeszutek Wilk
  Cc: xen-devel

On 09/09/16 16:16, Jennifer Herbert wrote:
> On 01/08/16 12:32, Ian Jackson wrote:
>> I think we need to introduce a new hypercall (which I will call DMOP
>> for now) which may augment or replace some of HVMCTL.  Let me explain:
>>
> 
> I believe the new 'DMOP' hypercall is a good idea,  but following on
> from discussions, I propose a revised design, which I present below.
> Please let me know what you think.
> 
> Thanks,
> Jenny.
> 
> 
> DMOP  (multi-buffer variant)
> ============================
[...]
> Advantages of this system, over previouse DMOP proposals:
> 
> 
> *  The validation of address ranges is easily done by the privcmd driver,
>    using standard kernel mechanisms.  No need to get Xen thinking about
>    guest memory layout, which it should be independent of, and potentially
>    adding confusion.

+1.

The user address limit in Linux is a per-thread property, so trying to
pass this information to the hypervisor would require passing this
information in every dm_op, or switching the information on every task
switch.  The user address limit is also architecture specific, and would
require every arch to expose this via some new API.

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* XenProject/XenServer QEMU working group minutes, 30th August 2016
  2016-07-28 17:01 XenProject/XenServer QEMU working group, Friday 8th July, 2016, 15:00 Jennifer Herbert
  2016-08-01 11:32 ` Device model operation hypercall (DMOP, re qemu depriv) Ian Jackson
@ 2016-09-09 16:18 ` Jennifer Herbert
  2016-09-12  7:16   ` Juergen Gross
  2016-10-14 18:01   ` QEMU XenServer/XenProject Working group meeting 29th September 2016 Jennifer Herbert
  1 sibling, 2 replies; 74+ messages in thread
From: Jennifer Herbert @ 2016-09-09 16:18 UTC (permalink / raw)
  To: xen-devel

QEMU XenServer/XenProject Working group meeting 30th August 2016
================================================================

Attendance
----------

Andrew Cooper
Ian Jackson
Paul Durrant
David Vrabel
Jennifer Herbert

Introduction
------------

Introduced Paul Durrant to the working group.

Started by recapping our purpose: A way to make it possible for qemu
to be able to make hypercalls without too much privilege, in a way
which is up streamable.    Dom0 guest must not be able abuse interface
to compromise dom0 kernel.

QEMU Hypercalls – DM op
-----------------------

Has been much discussion on XenDevel -  a problem identified is when
you have operations with references to  other user memory objects,
such as with Track dirty VRAM (As used with the VGA buffer)  At the
moment, apparently there is only that one, but others may emerge.

Most obvious solution would involve the guest kernel validating the
virtual address passed, however that would would rely on the guest
kernel knowing where to objects where.   This is to be avoided.

Ian recounts how there where variose proposals, on XenDevel involving,
essentially, informing the hypervisor, or some way providing the
information about which virtual addresses where being talked about by
the hypercall to the hypervisor.  Many of these involved this
information being transmitted via a different channel.

Ian suggest the idea provide a way for the kernel to tell the
hypervisor is user virtual rages, dm op allowed memory. And there
would be a flag, in the dm op, in a fixed location, that would tell
the hypervisor, this only talks about special dm pre-approved memory.


A scheme of pre-nominating an area in QEMU, maybe using hypercall
buffers is briefly discussed,as well as a few other ideas, but
concludes that doesn’t really address the problem of future DM ops –
of which there could easily be.  Even if we can avoid the problem by
special cases for our current set-up, we still need a story for how to
add future interfaces with handles, without the need to change the
kernel interface.  Once we come up with story, we wouldn't necessarily
have to implement it.


The concept of using physical addressed hypercall buffers was
discussed.  Privcmd could allocate you a place, and mmap it into user
ram, and this is the only memory that would be used with the
hypercalls.  A hypercall would tell you the buffer range. Each qemu
would need to be associated with the correct set of physical buffers.

A recent AMD proposal was discussed, which would use only physical
addresses, no virtual address.  The upshot being we should come up
with a solution that is not incompatible this.

Ideas further discussed: User code could just put stuff in mmaped
memory, and only refer to offset within that buffer.  The privcmd
driver would fill in physical details. All dm ops would have 3
arguments:  dm op, pointer to to struct, and optional pointer to
restriction array – the last of which is filled in my privcmd driver.
It is discussed how privcmd driver must not look at the dm op number,
in particular, to know how to validate addresses, as it must be
independent from the API.

A scheme where qemu calls an ioctl before it drops privileges, to set
up restrictions ahead of time, is discussed.  One scheme may work by
setting up a rang for a given domain or VCPU.

The assumption is that all device model, running in the same domain,
have the same virtual address layout.  Then there would be a flag, in
the stable bit of the api, if to apply that restriction  - any kernel
dm op would not apply that restriction.

The idea can be extended – to have more one address range, or can have
range explicitly provided in the hypercall.  This latter suggestion is
preferred, however each platform would have different valid address
ranges, and privcmd is platform independent.  Its discussed how a
function could be created to return valid rages for your given
platform, but this is not considered a element solution. The third
parameter of the dm op could be array of ranges, where common case for
virtual addresses may be 0-3GB, but for physical addresses, it might
be quite fragmented.


A further ideas is proposed to extend the dm op, to have a fixed part,
to have an array of guest handles, the kernel can audit. The
arguments would be:

Arg1: Dom ID:
Arg2: Guests handle array of tuples(address, size)
Arg3: Number guest handles.

The first element of the array could be the DM op structure itself,
containing the DM Op code, and othe argument to the perticular op.
The Privcmd driver would only pass though what is provided by the
user.  Any extra elements would be ignored by the hypercall, and if
there where insufficient, the hypercall code would see a NULL, and be
able to gracefully fail.

The initial block (of dm arguments) passed in the array would be
copied into pre-zeroed memory of max op size, having checked the size
is not greater then this.  No need to check minimum, buffer
initialised to zero, so zero length would result in op 0 being called.
Functions/Macros could be created to make retrieving such a block
easier.

Any further blocks needed would be implicitly refereed too, as a given
dm op knows it will put buffer foo in array position bar. It would
then use the provided function/macros to retrieve it.

This last idea is compared with the proposal previously posted to
xendevel by Ian.  This scheme is slightly messier in the dm op code,
having to refer to numbers instead of fields, however, the pros are
that:

* it's more extendible.  It dostn involeve providing a new weird copy
   to user memory macro, that can be misused, which security  implications.
* The restriction, is bound the the specific call, and can vary.
* Priv cmd slightly simpler, can just call access_ok.
* Is physical access scheme compatible.

David agrees to write this idea up in a design document. He will not
need to discuss any individual dm ops, but should describe the pros
and cons compared on other ideas on the table.


XenStore
--------

The xs-restrict mechanism was summarised, and its limitation – it does
not work though the kernel XenStore driver, which is needed to talk to
a XenStore domain.  A way to fix this would be to create a wrapper.

Another approach is to try and remove XenStore from all
non-priverlaged parts of QEMU – as it is thought there isn't that much
use remaining.  Protocols such as QMP would be used instead.  PV
drivers such as QDISK could be run in a separate qemu process – for
which a patch exists. There where concerns this would like a lot of
time to achieve.

Although time ran out, it was vaguely concluded that multiple
approaches could be run in parallel, where initially xs-restrict is
used as is, and then a the xenstore wrapper could be developed
alongside efforts to reduce XenStore use in QEMU.  Even with the
XenStore wrapper, QEMU may benefit from reducing the number of
communication protocols in use – ie removing XenStore use.



Action items
------------

David: Write up latest DM op proposal.
Jenny: Write up and arrange next meetup.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: XenProject/XenServer QEMU working group minutes, 30th August 2016
  2016-09-09 16:18 ` XenProject/XenServer QEMU working group minutes, 30th August 2016 Jennifer Herbert
@ 2016-09-12  7:16   ` Juergen Gross
  2016-10-14 18:01   ` QEMU XenServer/XenProject Working group meeting 29th September 2016 Jennifer Herbert
  1 sibling, 0 replies; 74+ messages in thread
From: Juergen Gross @ 2016-09-12  7:16 UTC (permalink / raw)
  To: Jennifer Herbert, xen-devel

On 09/09/16 18:18, Jennifer Herbert wrote:
> QEMU XenServer/XenProject Working group meeting 30th August 2016
> ================================================================
...
> XenStore
> --------
> 
> The xs-restrict mechanism was summarised, and its limitation – it does
> not work though the kernel XenStore driver, which is needed to talk to
> a XenStore domain.  A way to fix this would be to create a wrapper.
> 
> Another approach is to try and remove XenStore from all
> non-priverlaged parts of QEMU – as it is thought there isn't that much
> use remaining.  Protocols such as QMP would be used instead.  PV
> drivers such as QDISK could be run in a separate qemu process – for
> which a patch exists. There where concerns this would like a lot of
> time to achieve.
> 
> Although time ran out, it was vaguely concluded that multiple
> approaches could be run in parallel, where initially xs-restrict is
> used as is, and then a the xenstore wrapper could be developed
> alongside efforts to reduce XenStore use in QEMU.  Even with the
> XenStore wrapper, QEMU may benefit from reducing the number of
> communication protocols in use – ie removing XenStore use.

What about the following:

Split up the transaction ID of Xenstore messages (32 bits) to two 16
bit parts. The high part is a restriction ID (0 == unrestricted).
As the transaction ID is local to a connection 16 bits should always
be enough for transaction IDs: we would still have more transaction
IDs available than domain IDs.

xs-restrict returns a restriction ID to be used for the connection
from now on. This can be easily added by a kernel wrapper without
having to modify the Xenstore protocol: there are just some bits
with a special meaning now which have always been under control of the
Xenstore daemon/domain.

A socket connection with Xenstore daemon using xs-restrict will force
the complete connection to be restricted (as today in oxenstored), while
a connection from another domain (Xenstore domain case) will rely on the
kernel wrapper of the connecting domain to do the restriction ID
handling correctly.

Adding support for that feature in Xenstore daemon/domain is very easy
as the needed modifications should be rather local.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-09 15:16   ` Jennifer Herbert
  2016-09-09 15:34     ` David Vrabel
@ 2016-09-12 13:47     ` George Dunlap
  2016-09-12 14:32     ` Jan Beulich
  2 siblings, 0 replies; 74+ messages in thread
From: George Dunlap @ 2016-09-12 13:47 UTC (permalink / raw)
  To: Jennifer Herbert, Ian Jackson, Jan Beulich, Andrew Cooper,
	dgdegra, Wei Liu, George Dunlap, Stefano Stabellini, Tim Deegan,
	David Vrabel, Anthony Perard, Konrad Rzeszutek Wilk
  Cc: xen-devel

On 09/09/16 16:16, Jennifer Herbert wrote:
> The following code illustrates this idea:
> 
> typedef struct dm_op_buffer {
>     XEN_GUEST_HANDLE(void) h;
>     size_t len;
> } dm_op_buffer_t;
> 
> int
> HYPERVISOR_device_model_op(
>     domid_t domid,
>     unsigned int nr_buffers,
>     XEN_GUEST_HANDLE_PARAM(dm_op_buffer_t) buffers)
> 
> @domid: the domain the hypercall operates on.
> @nr_buffers; the number of buffers in the @buffers array.
> 
> @buffers: an array of buffers.  @buffers[0] contains device_model_op - the
> structure describing the sub-op and its parameters. @buffers[1],
> @buffers[2]
> etc. may be used by a sub-op for passing additional buffers.
> 
> struct device_model_op {
>     uint32_t op;
>     union {
>          struct op_1 op1;
>          struct op_2 op2;
>          /* etc... */
>     } u;
> };
> 
> It is forbidden for the above struct (device_model_op) to contain any
> guest handles - if they are needed, they should instead be in
> HYPERVISOR_device_model_op->buffers.

Sounds plausible to me.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-09 15:16   ` Jennifer Herbert
  2016-09-09 15:34     ` David Vrabel
  2016-09-12 13:47     ` George Dunlap
@ 2016-09-12 14:32     ` Jan Beulich
  2016-09-13 10:37       ` George Dunlap
  2016-09-13 16:07       ` David Vrabel
  2 siblings, 2 replies; 74+ messages in thread
From: Jan Beulich @ 2016-09-12 14:32 UTC (permalink / raw)
  To: Jennifer Herbert
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, David Vrabel, Anthony Perard, xen-devel,
	dgdegra

>>> On 09.09.16 at 17:16, <Jennifer.Herbert@citrix.com> wrote:
> The following code illustrates this idea:
> 
> typedef struct dm_op_buffer {
>      XEN_GUEST_HANDLE(void) h;
>      size_t len;
> } dm_op_buffer_t;

This implies that we'll lose all type safety on the handles passed, as
is also emphasized by the use of raw_copy_from_guest() in the code
outline further down.

> int
> HYPERVISOR_device_model_op(
>      domid_t domid,
>      unsigned int nr_buffers,
>      XEN_GUEST_HANDLE_PARAM(dm_op_buffer_t) buffers)

Along the same lines this will add implicit agreements (presumably
solely written out as comments, or maybe via manifest constants)
about which element of the buffer array has which meaning, quite
contrary to the otherwise enforced agreement (through use of
structure fields).

> int copy_dm_buffer_from_guest(
>      void *dst,                        /* Kernel destination buffer      */
>      size_t max_len,                   /* Size of destination buffer     */
>      XEN_GUEST_HANDLE_PARAM(dm_op_buffer_t) buffers,
>                                        /* dm_op_buffers passed into DMOP */
>      unsigned int nr_buffers,          /* total number of dm_op_buffers  */
>      unsigned int idx)                 /* Index of buffer we require     */

Looking at other hypercalls, it is not uncommon that arrays get
read element by element. While of course this function can be
extended suitably (perhaps with additional macro wrappers to
deal with the base types of such arrays), it will then become more
cumbersome to use, extending the "Minor stylistic issue" mentioned
in the disadvantages section further down.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-12 14:32     ` Jan Beulich
@ 2016-09-13 10:37       ` George Dunlap
  2016-09-13 11:53         ` Jan Beulich
  2016-09-13 16:07       ` David Vrabel
  1 sibling, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-09-13 10:37 UTC (permalink / raw)
  To: Jan Beulich, Jennifer Herbert
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, David Vrabel, Anthony Perard, xen-devel,
	dgdegra

On 12/09/16 15:32, Jan Beulich wrote:
>>>> On 09.09.16 at 17:16, <Jennifer.Herbert@citrix.com> wrote:
>> The following code illustrates this idea:
>>
>> typedef struct dm_op_buffer {
>>      XEN_GUEST_HANDLE(void) h;
>>      size_t len;
>> } dm_op_buffer_t;
> 
> This implies that we'll lose all type safety on the handles passed, as
> is also emphasized by the use of raw_copy_from_guest() in the code
> outline further down.

If most of the time the hypercalls are made by calling libxc functions,
and the libxc functions have types as arguments, then the end caller has
the same type safety.  We'll have to be careful inside the individual
libxc functions, but that should be fairly straightforward to do.  So
the places where we need to take extra care should be very localized.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-13 10:37       ` George Dunlap
@ 2016-09-13 11:53         ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2016-09-13 11:53 UTC (permalink / raw)
  To: George Dunlap, Jennifer Herbert
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	IanJackson, Tim Deegan, David Vrabel, Anthony Perard, xen-devel,
	dgdegra

>>> On 13.09.16 at 12:37, <george.dunlap@citrix.com> wrote:
> On 12/09/16 15:32, Jan Beulich wrote:
>>>>> On 09.09.16 at 17:16, <Jennifer.Herbert@citrix.com> wrote:
>>> The following code illustrates this idea:
>>>
>>> typedef struct dm_op_buffer {
>>>      XEN_GUEST_HANDLE(void) h;
>>>      size_t len;
>>> } dm_op_buffer_t;
>> 
>> This implies that we'll lose all type safety on the handles passed, as
>> is also emphasized by the use of raw_copy_from_guest() in the code
>> outline further down.
> 
> If most of the time the hypercalls are made by calling libxc functions,
> and the libxc functions have types as arguments, then the end caller has
> the same type safety.  We'll have to be careful inside the individual
> libxc functions, but that should be fairly straightforward to do.  So
> the places where we need to take extra care should be very localized.

My main fear isn't so much to get it wrong the first time round, but
to break things down the road by not adjusting types in all needed
places at once (because the compiler doesn't tell the programmer).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-12 14:32     ` Jan Beulich
  2016-09-13 10:37       ` George Dunlap
@ 2016-09-13 16:07       ` David Vrabel
  2016-09-14  9:51         ` Jan Beulich
  1 sibling, 1 reply; 74+ messages in thread
From: David Vrabel @ 2016-09-13 16:07 UTC (permalink / raw)
  To: Jan Beulich, Jennifer Herbert
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, David Vrabel, Anthony Perard, xen-devel,
	dgdegra

On 12/09/16 15:32, Jan Beulich wrote:
>>>> On 09.09.16 at 17:16, <Jennifer.Herbert@citrix.com> wrote:
>> The following code illustrates this idea:
>>
>> typedef struct dm_op_buffer {
>>      XEN_GUEST_HANDLE(void) h;
>>      size_t len;
>> } dm_op_buffer_t;
> 
> This implies that we'll lose all type safety on the handles passed, as
> is also emphasized by the use of raw_copy_from_guest() in the code
> outline further down.

This is an direct result of the requirement that the privcmd driver does
not know the types of the sub-ops themselves.  We can't have this
requirement and type safety.  Which do we want?

I would point out that Linux copy_from_user() and copy_to_user()
functions are not type safe and I'm not aware that this causes many
problems.

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-13 16:07       ` David Vrabel
@ 2016-09-14  9:51         ` Jan Beulich
  2016-09-21 11:21           ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-09-14  9:51 UTC (permalink / raw)
  To: David Vrabel, Jennifer Herbert
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, TimDeegan, Anthony Perard, xen-devel, dgdegra

>>> On 13.09.16 at 18:07, <david.vrabel@citrix.com> wrote:
> On 12/09/16 15:32, Jan Beulich wrote:
>>>>> On 09.09.16 at 17:16, <Jennifer.Herbert@citrix.com> wrote:
>>> The following code illustrates this idea:
>>>
>>> typedef struct dm_op_buffer {
>>>      XEN_GUEST_HANDLE(void) h;
>>>      size_t len;
>>> } dm_op_buffer_t;
>> 
>> This implies that we'll lose all type safety on the handles passed, as
>> is also emphasized by the use of raw_copy_from_guest() in the code
>> outline further down.
> 
> This is an direct result of the requirement that the privcmd driver does
> not know the types of the sub-ops themselves.  We can't have this
> requirement and type safety.  Which do we want?

Both, which the proposal utilizing side band information on VA
ranges allows for. (And in any event this to me clearly is an
aspect that would need to be mentioned in the disadvantages
section of the document.)

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-14  9:51         ` Jan Beulich
@ 2016-09-21 11:21           ` Ian Jackson
  2016-09-21 11:28             ` George Dunlap
  2016-09-21 11:55             ` Jan Beulich
  0 siblings, 2 replies; 74+ messages in thread
From: Ian Jackson @ 2016-09-21 11:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, TimDeegan, David Vrabel, Anthony Perard,
	xen-devel, dgdegra

Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, re qemu depriv)"):
> On 13.09.16 at 18:07, <david.vrabel@citrix.com> wrote:
> > This is an direct result of the requirement that the privcmd driver does
> > not know the types of the sub-ops themselves.  We can't have this
> > requirement and type safety.  Which do we want?
> 
> Both, which the proposal utilizing side band information on VA
> ranges allows for. (And in any event this to me clearly is an
> aspect that would need to be mentioned in the disadvantages
> section of the document.)

The side band information approach has the problem that it is easy to
accidentally write hypervisor code which misses the
additionally-required range check.

It might be possible to provide both strict type safety of the kind
you're concerned about here, _and_ protection against missing access
checks, but it's not very straightforward.

One approach to do that would be to augment the guest handle array
proposal with a typesafe macro system for accessing the guest handle
slots.

But I think George has a good argument as to why this is not needed:

  If most of the time the hypercalls are made by calling libxc functions,
  and the libxc functions have types as arguments, then the end caller has
  the same type safety.  We'll have to be careful inside the individual
  libxc functions, but that should be fairly straightforward to do.  So
  the places where we need to take extra care should be very localized.

We do not expect DMOP callers to make the hypercalls directly.  (They
can't sensibly do so because the DMOP ABI is not stable - they need
the assistance of the stability layer which we intend to provide in
libxc.)

So the actual hypercall call sites are all in-tree, in libxc.  If the
format of the struct used for one of these guest handle slots changes,
the same struct definition from the Xen headers is used both in the
hypervisor and in libxc, and any incompatibility will be detected.

It's true that changes to the semantics of these slots (for example, a
change of a slot from "array of struct X" to "one struct Y") will not
be detected by the compiler.

But *all* ABI changes to the DMOP interface need to be accompanied by
changes to libxc to add compatibility code.  I think it will be fairly
straightforward to check, during review, that each DMOP change comes
with a corresponding libxc change.


But if you do not agree, then how about hiding the slots with a macro
setup ?  Something like:

    struct device_model_op {
         uint32_t op;
         union {
              struct op_foo foo;
              struct op_bar bar;
              /* etc... */
         } u;
    };

    struct op_foo {
       some stuff;
    #define DMOP_GUESTHANDLES_foo(O,ONE,MANY) \
       ONE(O,  struct op_foo_big_block,           big_block) \
       MANY(O, struct op_foo_data_array_element,  data_array)
    };
    DMOP_DEFINE(foo)


Supported (preceded by) something like this (many toothpicks elided
for clarity and ease of editing):

    /* dmop typesafe slot machinery: */

    #define DMOP_DEFINE(opname) \
        enum {
            DMOP_GUESTHANDLES_##opname(DMOP__SLOT_INDEX,DMOP__SLOT_INDEX)
        };
        DMOP_GUESTHANDLES_##opname(DMOP__ACCESSORS_ONE,DMOP__ACCESSORS_MANY)

    #define DMOP__SLOT_INDEX(O,T,N) DMOP__SLOT_INDEX__##O##_##N,

    #define DMOP__ACCESSORS_ONE(O,T,N) \
        static inline int copy_dm_buffer_from_guest_##O##_##N(
            T *dst,
            const struct device_model_op *dmop /* accesses [nr_]buffers */)
        {
            return copy_dm_buffer_from_guest__raw
                ((void*)dst, sizeof(*dst), dmop,
                 DMOP__SLOT_INDEX__##O##_##N);
        }
        likewise copy_...to

    #define DMOP__ACCESSORS_MANY(O,T,N) \
        static inline size_t get_dm_buffer_array_size_##O##_##N ... {
            calculation divides buffer size by type size;
        }
        static inline long copy_dm_buffer_from_guest_##O##_##N(
            T *dst, size_t nr_dst,
            const struct device_model_op *dmop /* accesses [nr_]buffers */)
        {


Personally I don't think this is worth the effort.  But if you do I
would be happy to turn this into something which actually compiles
:-).

Thanks for your attention.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-21 11:21           ` Ian Jackson
@ 2016-09-21 11:28             ` George Dunlap
  2016-09-21 11:58               ` Jan Beulich
  2016-09-21 11:55             ` Jan Beulich
  1 sibling, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-09-21 11:28 UTC (permalink / raw)
  To: Ian Jackson, Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, TimDeegan, David Vrabel, Anthony Perard,
	xen-devel, dgdegra

On 21/09/16 12:21, Ian Jackson wrote:
> Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, re qemu depriv)"):
>> On 13.09.16 at 18:07, <david.vrabel@citrix.com> wrote:
>>> This is an direct result of the requirement that the privcmd driver does
>>> not know the types of the sub-ops themselves.  We can't have this
>>> requirement and type safety.  Which do we want?
>>
>> Both, which the proposal utilizing side band information on VA
>> ranges allows for. (And in any event this to me clearly is an
>> aspect that would need to be mentioned in the disadvantages
>> section of the document.)
> 
> The side band information approach has the problem that it is easy to
> accidentally write hypervisor code which misses the
> additionally-required range check.
> 
> It might be possible to provide both strict type safety of the kind
> you're concerned about here, _and_ protection against missing access
> checks, but it's not very straightforward.
> 
> One approach to do that would be to augment the guest handle array
> proposal with a typesafe macro system for accessing the guest handle
> slots.
> 
> But I think George has a good argument as to why this is not needed:
> 
>   If most of the time the hypercalls are made by calling libxc functions,
>   and the libxc functions have types as arguments, then the end caller has
>   the same type safety.  We'll have to be careful inside the individual
>   libxc functions, but that should be fairly straightforward to do.  So
>   the places where we need to take extra care should be very localized.
> 
> We do not expect DMOP callers to make the hypercalls directly.  (They
> can't sensibly do so because the DMOP ABI is not stable - they need
> the assistance of the stability layer which we intend to provide in
> libxc.)
> 
> So the actual hypercall call sites are all in-tree, in libxc.  If the
> format of the struct used for one of these guest handle slots changes,
> the same struct definition from the Xen headers is used both in the
> hypervisor and in libxc, and any incompatibility will be detected.
> 
> It's true that changes to the semantics of these slots (for example, a
> change of a slot from "array of struct X" to "one struct Y") will not
> be detected by the compiler.
> 
> But *all* ABI changes to the DMOP interface need to be accompanied by
> changes to libxc to add compatibility code.  I think it will be fairly
> straightforward to check, during review, that each DMOP change comes
> with a corresponding libxc change.
> 
> 
> But if you do not agree, then how about hiding the slots with a macro
> setup ?  Something like:
> 
>     struct device_model_op {
>          uint32_t op;
>          union {
>               struct op_foo foo;
>               struct op_bar bar;
>               /* etc... */
>          } u;
>     };
> 
>     struct op_foo {
>        some stuff;
>     #define DMOP_GUESTHANDLES_foo(O,ONE,MANY) \
>        ONE(O,  struct op_foo_big_block,           big_block) \
>        MANY(O, struct op_foo_data_array_element,  data_array)
>     };
>     DMOP_DEFINE(foo)
> 
> 
> Supported (preceded by) something like this (many toothpicks elided
> for clarity and ease of editing):
> 
>     /* dmop typesafe slot machinery: */
> 
>     #define DMOP_DEFINE(opname) \
>         enum {
>             DMOP_GUESTHANDLES_##opname(DMOP__SLOT_INDEX,DMOP__SLOT_INDEX)
>         };
>         DMOP_GUESTHANDLES_##opname(DMOP__ACCESSORS_ONE,DMOP__ACCESSORS_MANY)
> 
>     #define DMOP__SLOT_INDEX(O,T,N) DMOP__SLOT_INDEX__##O##_##N,
> 
>     #define DMOP__ACCESSORS_ONE(O,T,N) \
>         static inline int copy_dm_buffer_from_guest_##O##_##N(
>             T *dst,
>             const struct device_model_op *dmop /* accesses [nr_]buffers */)
>         {
>             return copy_dm_buffer_from_guest__raw
>                 ((void*)dst, sizeof(*dst), dmop,
>                  DMOP__SLOT_INDEX__##O##_##N);
>         }
>         likewise copy_...to
> 
>     #define DMOP__ACCESSORS_MANY(O,T,N) \
>         static inline size_t get_dm_buffer_array_size_##O##_##N ... {
>             calculation divides buffer size by type size;
>         }
>         static inline long copy_dm_buffer_from_guest_##O##_##N(
>             T *dst, size_t nr_dst,
>             const struct device_model_op *dmop /* accesses [nr_]buffers */)
>         {
> 
> 
> Personally I don't think this is worth the effort.  But if you do I
> would be happy to turn this into something which actually compiles
> :-).
> 
> Thanks for your attention.

I think the other slight bit of information missing from Jan at least
(and perhaps also David) is the difference between their preference /
recommendation and their requirement.

That is, if David and Jan each strongly recommend their own preferred
method, but are willing to (perhaps grudingly) give Ack's to an
implementaton of the other's method, then it's really down to the one
doing the implementation to read the advice and make their own decision
as best they can.

If someone feels strongly enough to Nack the other version, please say
so now; otherwise, I think Ian (since it seems like he'll be the one
implementing it) should make his best judgement based on the arguments
available and begin implementation.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-21 11:21           ` Ian Jackson
  2016-09-21 11:28             ` George Dunlap
@ 2016-09-21 11:55             ` Jan Beulich
  2016-09-21 12:23               ` Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages] Ian Jackson
  1 sibling, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-09-21 11:55 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, TimDeegan, David Vrabel, Anthony Perard,
	xen-devel, dgdegra

>>> On 21.09.16 at 13:21, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, 
> re qemu depriv)"):
>> On 13.09.16 at 18:07, <david.vrabel@citrix.com> wrote:
>> > This is an direct result of the requirement that the privcmd driver does
>> > not know the types of the sub-ops themselves.  We can't have this
>> > requirement and type safety.  Which do we want?
>> 
>> Both, which the proposal utilizing side band information on VA
>> ranges allows for. (And in any event this to me clearly is an
>> aspect that would need to be mentioned in the disadvantages
>> section of the document.)
> 
> The side band information approach has the problem that it is easy to
> accidentally write hypervisor code which misses the
> additionally-required range check.
> 
> It might be possible to provide both strict type safety of the kind
> you're concerned about here, _and_ protection against missing access
> checks, but it's not very straightforward.
> 
> One approach to do that would be to augment the guest handle array
> proposal with a typesafe macro system for accessing the guest handle
> slots.
> 
> But I think George has a good argument as to why this is not needed:
> 
>   If most of the time the hypercalls are made by calling libxc functions,
>   and the libxc functions have types as arguments, then the end caller has
>   the same type safety.  We'll have to be careful inside the individual
>   libxc functions, but that should be fairly straightforward to do.  So
>   the places where we need to take extra care should be very localized.
> 
> We do not expect DMOP callers to make the hypercalls directly.  (They
> can't sensibly do so because the DMOP ABI is not stable - they need
> the assistance of the stability layer which we intend to provide in
> libxc.)
> 
> So the actual hypercall call sites are all in-tree, in libxc.  If the
> format of the struct used for one of these guest handle slots changes,
> the same struct definition from the Xen headers is used both in the
> hypervisor and in libxc, and any incompatibility will be detected.

Wait, no. The guest handle slots in Jenny's proposal are basically
typeless:

typedef struct dm_op_buffer {
     XEN_GUEST_HANDLE(void) h;
     size_t len;
} dm_op_buffer_t;

Each sub-op implies a certain type for each handle slot it actually
uses.

And then I disagree with the general view taken here: The issue
I see is not with caller of the libxc interfaces (it was always clear
that it would be possible to have type safety at that layer), but
between libxc as the consumer and Xen as the producer of the
interface.

> It's true that changes to the semantics of these slots (for example, a
> change of a slot from "array of struct X" to "one struct Y") will not
> be detected by the compiler.

Well, the "one" vs "array of" part can't normally be taken care of
by compiler type checking anyway, except that for the "one" case
we wouldn't normally use a handle in the first place.

> But *all* ABI changes to the DMOP interface need to be accompanied by
> changes to libxc to add compatibility code.  I think it will be fairly
> straightforward to check, during review, that each DMOP change comes
> with a corresponding libxc change.

Well, yes, we can only hope for that. The main risk is that someone
doesn't update the other side at all, when a change to one side gets
done which compiles fine at both ends, and perhaps doesn't even
alter the public header.

> But if you do not agree, then how about hiding the slots with a macro
> setup ?  Something like:
>[...]

Well, that's exactly the kind of workaround I'd like to avoid.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv)
  2016-09-21 11:28             ` George Dunlap
@ 2016-09-21 11:58               ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2016-09-21 11:58 UTC (permalink / raw)
  To: George Dunlap
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, TimDeegan, David Vrabel, Anthony Perard,
	xen-devel, dgdegra, Ian Jackson

>>> On 21.09.16 at 13:28, <george.dunlap@citrix.com> wrote:
> I think the other slight bit of information missing from Jan at least
> (and perhaps also David) is the difference between their preference /
> recommendation and their requirement.
> 
> That is, if David and Jan each strongly recommend their own preferred
> method, but are willing to (perhaps grudingly) give Ack's to an
> implementaton of the other's method, then it's really down to the one
> doing the implementation to read the advice and make their own decision
> as best they can.
> 
> If someone feels strongly enough to Nack the other version, please say
> so now; otherwise, I think Ian (since it seems like he'll be the one
> implementing it) should make his best judgement based on the arguments
> available and begin implementation.

No, I don't think I'd nack David's / Jenny's version (after all the type
issue can be taken care off by sufficient discipline of both submitters
and reviewers), but I do think that it failed to point out this downside,
which imo moves the balance between the two proposals.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]
  2016-09-21 11:55             ` Jan Beulich
@ 2016-09-21 12:23               ` Ian Jackson
  2016-09-21 12:48                 ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-09-21 12:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, George Dunlap, TimDeegan, David Vrabel,
	Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, re qemu depriv)"):
> Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, 
> > So the actual hypercall call sites are all in-tree, in libxc.  If the
> > format of the struct used for one of these guest handle slots changes,
> > the same struct definition from the Xen headers is used both in the
> > hypervisor and in libxc, and any incompatibility will be detected.
> 
> Wait, no. The guest handle slots in Jenny's proposal are basically
> typeless:
...
> Each sub-op implies a certain type for each handle slot it actually
> uses.

Yes.  But in practice each such slot will in practice be accessed by
using copy_dm_buffer_from_guest (or whatever) to copy it to a struct.
It is intended that that same struct type will be used by libxc to
populate the buffer.

Now, it is true that the compiler cannot check that the _same type_ is
used in both places.  So, as I say:

   It's true that changes to the semantics of these slots (for example, a
   change of a slot from "array of struct X" to "one struct Y") will not
   be detected by the compiler.

But changes in the contents of the specific struct /will/ be spotted.

For example, consider the struct device_model_op, which in Jennifer's
document is in slot 0.  If someone edits struct device_model_op to
change its contents, both libxc and Xen see the same modified struct.

The problem can only occur if someone accesses a slot with entirely
the wrong struct type, or by the wrong slot number.  In such a
situation at least in new code, the thing probably wouldn't work at
all.

And even if we had some kind of typesafe system, the typesafety still
wouldn't spot uncontrolled ABI changes which would silently break old
callers.  So the need for careful review of DMOP changes is still
present.

So overall I think this typesafety problem is fairly limited.

> And then I disagree with the general view taken here: The issue
> I see is not with caller of the libxc interfaces (it was always clear
> that it would be possible to have type safety at that layer), but
> between libxc as the consumer and Xen as the producer of the
> interface.

I think you have misunderstood me.  I was discussing precisely
typesafety of that interface.


> > But *all* ABI changes to the DMOP interface need to be accompanied by
> > changes to libxc to add compatibility code.  I think it will be fairly
> > straightforward to check, during review, that each DMOP change comes
> > with a corresponding libxc change.
> 
> Well, yes, we can only hope for that. The main risk is that someone
> doesn't update the other side at all, when a change to one side gets
> done which compiles fine at both ends, and perhaps doesn't even
> alter the public header.

As I say I think if people are making uncontrolled changes to this
area, we have ABI stability problems not covered by the guest handle
typesafety system.  If we wanted to sort that would probably need to
make a proper DSL for the ABI.


> > But if you do not agree, then how about hiding the slots with a macro
> > setup ?  Something like:
> >[...]
> 
> Well, that's exactly the kind of workaround I'd like to avoid.

Well, I think it's a lot of effort for not much gain.  But it's
certainly doable and if you think it worthwhile I'm prepared to try
it.


Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, re qemu depriv)"):
> On 21.09.16 at 13:28, <george.dunlap@citrix.com> wrote:
> > If someone feels strongly enough to Nack the other version, please say
> > so now; otherwise, I think Ian (since it seems like he'll be the one
> > implementing it) should make his best judgement based on the arguments
> > available and begin implementation.
> 
> No, I don't think I'd nack David's / Jenny's version (after all the type
> issue can be taken care off by sufficient discipline of both submitters
> and reviewers),

Thanks.

Personally I think David/Jennifer's version is better than the
previous proposal based on implicit or separate address limits.


>  but I do think that it failed to point out this downside,
> which imo moves the balance between the two proposals.

I'm afraid that I want to complain about this part of your approach to
the thread, which I think is unduly harsh.

It is of course fair to point out a potential downside of the proposal,
which wasn't clearly discussed in the discussion document.  And it is
sensible for us all to consider that potential downside along with the
others - as indeed I do above.

But I don't think it is really fair to criticise *the discussion
document* (which is what Jennifer's email was) on the grounds that it
ommitted to discuss a potential downside which its authors felt was
minor[1].  The purpose of a discussion document is to put forward a
proposal and/or summarise previous discussion.  It is not required to
be either neutral or, indeed, comprehensive - and even so, I found
this one quite comprehensive.

Personally I think Jennifer's email was an excellent contribution [2]
and I would be inclined to use it as an example if anyone were to ask
in future what a constructive design discussion document should look
like.

If there was a plan to *commit* the discussion document to xen.git or
formally publish it somewhere then you might say "please also discuss
this question".  Then it would be worth the discouragement implied bty
criticism, to improve the document.

But since the pros/cons section of this document has now served its
purpose there is only harm in criticising it in this way.  (And as I
say I think the criticism is unfounded.)

In summary: writing up a good summary of the discussion is a very
virtuous thing to do, which we in the Xen community should do more of.

The goal of encouraging people to do that is undermined by complaining
about the contents of such a writeup, rather than by constructively
addressing the actual substantive issues at stake.

Thanks,
Ian.


[1] Full disclosure: Jennifer sent me a preview of the email
beforehand, and I didn't think the typesafety issue was significant
enough to suggest she should mention it.

[2] As indeed I said to Jennifer in private email before she posted
it.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]
  2016-09-21 12:23               ` Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages] Ian Jackson
@ 2016-09-21 12:48                 ` Jan Beulich
  2016-09-21 13:24                   ` Ian Jackson
  0 siblings, 1 reply; 74+ messages in thread
From: Jan Beulich @ 2016-09-21 12:48 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, George Dunlap, TimDeegan, David Vrabel,
	Anthony Perard, xen-devel, dgdegra

>>> On 21.09.16 at 14:23, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, 
> re qemu depriv)"):
>> Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, 
>> > So the actual hypercall call sites are all in-tree, in libxc.  If the
>> > format of the struct used for one of these guest handle slots changes,
>> > the same struct definition from the Xen headers is used both in the
>> > hypervisor and in libxc, and any incompatibility will be detected.
>> 
>> Wait, no. The guest handle slots in Jenny's proposal are basically
>> typeless:
> ...
>> Each sub-op implies a certain type for each handle slot it actually
>> uses.
> 
> Yes.  But in practice each such slot will in practice be accessed by
> using copy_dm_buffer_from_guest (or whatever) to copy it to a struct.
> It is intended that that same struct type will be used by libxc to
> populate the buffer.
> 
> Now, it is true that the compiler cannot check that the _same type_ is
> used in both places.  So, as I say:
> 
>    It's true that changes to the semantics of these slots (for example, a
>    change of a slot from "array of struct X" to "one struct Y") will not
>    be detected by the compiler.
> 
> But changes in the contents of the specific struct /will/ be spotted.

As long as it is a structure, yes. What about someone changing
uint64_t to xen_pfn_t?

>>  but I do think that it failed to point out this downside,
>> which imo moves the balance between the two proposals.
> 
> I'm afraid that I want to complain about this part of your approach to
> the thread, which I think is unduly harsh.
> 
> It is of course fair to point out a potential downside of the proposal,
> which wasn't clearly discussed in the discussion document.  And it is
> sensible for us all to consider that potential downside along with the
> others - as indeed I do above.
> 
> But I don't think it is really fair to criticise *the discussion
> document* (which is what Jennifer's email was) on the grounds that it
> ommitted to discuss a potential downside which its authors felt was
> minor[1].

Hmm, then I'm sorry if this came over the wrong way. I didn't
mean to criticize the document as a whole, or how it was written.

>  The purpose of a discussion document is to put forward a
> proposal and/or summarise previous discussion.  It is not required to
> be either neutral or, indeed, comprehensive - and even so, I found
> this one quite comprehensive.

Nevertheless I think such a document should be as honest as
possible wrt downsides of the (by the author(s)) preferred of
potentially multiple approaches. If some aspect is deemed minor,
I don't think it should be omitted (as then readers won't know
whether the aspect was considered at all), but mentioned and
stated to be believed to be minor.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]
  2016-09-21 12:48                 ` Jan Beulich
@ 2016-09-21 13:24                   ` Ian Jackson
  2016-09-21 13:56                     ` Jan Beulich
  0 siblings, 1 reply; 74+ messages in thread
From: Ian Jackson @ 2016-09-21 13:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, George Dunlap, TimDeegan, David Vrabel,
	Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]"):
> On 21.09.16 at 14:23, <ian.jackson@eu.citrix.com> wrote:
> > But changes in the contents of the specific struct /will/ be spotted.
> 
> As long as it is a structure, yes. What about someone changing
> uint64_t to xen_pfn_t?

You mean if someone changes one of the buffers from "array of
uint64_t" to "array of xen_pfn_t", so that the ABI changes on some but
not all architectures ?

We could declare a rule that a dmop buffer may contain only a struct
or type name specific to that dmop (or an array of such), and must
always be accessed through a copy to or from such a type.

The rule would mean that if we had a dmop which wanted to call, using
one of the buffers, a function like this:

     int some_function(uint64_t *pfns, size_t n_pfns);

You'd have to write the dmop definition like this:

     typedef uint64_t xen_dmop_foo_entry;

     xen_dmop_foo_entry *pfns;
     size_t n_pfns;

     ...allocate table somehow...
     ret = copy_dm_buffer_from_guest(pfns,
                                     sizeof(pfns[0])*n_pfns,
                                     buffers, nr_buffers, 1);
     ...
     ret = some_function(pfns, n_pfns);

And similar code, again using xen_dmop_foo_entry. on the calling side.
Then if the buffer entry type is changed to xen_pfn_t you change it
to:

     typedef xen_pfn_t xen_dmop_foo_entry;

Now, unless the rest of the code is changed too, the compiler will
warn that a xen_pfn_t* cannot be converted to a uint64_t* in the call
to some_function.  (And likewise at the hypercall site in libxc.)

Would this be enough of an improvement, for you, to be worth the
additional type name clutter etc. ?


> > But I don't think it is really fair to criticise *the discussion
> > document* (which is what Jennifer's email was) on the grounds that it
> > ommitted to discuss a potential downside which its authors felt was
> > minor[1].
> 
> Hmm, then I'm sorry if this came over the wrong way. I didn't
> mean to criticize the document as a whole, or how it was written.

Thanks.

> >  The purpose of a discussion document is to put forward a
> > proposal and/or summarise previous discussion.  It is not required to
> > be either neutral or, indeed, comprehensive - and even so, I found
> > this one quite comprehensive.
> 
> Nevertheless I think such a document should be as honest as
> possible wrt downsides of the (by the author(s)) preferred of
> potentially multiple approaches.

I agree with that, although the word "honest" is rather tendentious.

>  If some aspect is deemed minor,
> I don't think it should be omitted (as then readers won't know
> whether the aspect was considered at all), but mentioned and
> stated to be believed to be minor.

I think there is a balance to be struck between mentioning every
possible consideration, however minor and remote, and making the
document concise and readable.

It will inevitably sometimes occur (as it did here) that some issue
that one person thought not worth mentioning, seems quite important to
another person.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]
  2016-09-21 13:24                   ` Ian Jackson
@ 2016-09-21 13:56                     ` Jan Beulich
  2016-09-21 15:06                       ` Ian Jackson
  2016-09-21 17:09                       ` George Dunlap
  0 siblings, 2 replies; 74+ messages in thread
From: Jan Beulich @ 2016-09-21 13:56 UTC (permalink / raw)
  To: Ian Jackson
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, George Dunlap, TimDeegan, David Vrabel,
	Anthony Perard, xen-devel, dgdegra

>>> On 21.09.16 at 15:24, <ian.jackson@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, 
> re qemu depriv) [and 1 more messages]"):
>> On 21.09.16 at 14:23, <ian.jackson@eu.citrix.com> wrote:
>> > But changes in the contents of the specific struct /will/ be spotted.
>> 
>> As long as it is a structure, yes. What about someone changing
>> uint64_t to xen_pfn_t?
> 
> You mean if someone changes one of the buffers from "array of
> uint64_t" to "array of xen_pfn_t", so that the ABI changes on some but
> not all architectures ?
> 
> We could declare a rule that a dmop buffer may contain only a struct
> or type name specific to that dmop (or an array of such), and must
> always be accessed through a copy to or from such a type.
> 
> The rule would mean that if we had a dmop which wanted to call, using
> one of the buffers, a function like this:
> 
>      int some_function(uint64_t *pfns, size_t n_pfns);
> 
> You'd have to write the dmop definition like this:
> 
>      typedef uint64_t xen_dmop_foo_entry;
> 
>      xen_dmop_foo_entry *pfns;
>      size_t n_pfns;
> 
>      ...allocate table somehow...
>      ret = copy_dm_buffer_from_guest(pfns,
>                                      sizeof(pfns[0])*n_pfns,
>                                      buffers, nr_buffers, 1);
>      ...
>      ret = some_function(pfns, n_pfns);
> 
> And similar code, again using xen_dmop_foo_entry. on the calling side.
> Then if the buffer entry type is changed to xen_pfn_t you change it
> to:
> 
>      typedef xen_pfn_t xen_dmop_foo_entry;
> 
> Now, unless the rest of the code is changed too, the compiler will
> warn that a xen_pfn_t* cannot be converted to a uint64_t* in the call
> to some_function.  (And likewise at the hypercall site in libxc.)
> 
> Would this be enough of an improvement, for you, to be worth the
> additional type name clutter etc. ?

Again this looks like much clutter with little benefit to me, i.e. I'd
then rather go with the unmodified original proposal. That's largely
because nothing really enforces anyone to use the (disconnected)
xen_dmop_foo_entry type. I.e. it continues to be a matter of the
programmer's and the reviewers' attention/discipline.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]
  2016-09-21 13:56                     ` Jan Beulich
@ 2016-09-21 15:06                       ` Ian Jackson
  2016-09-21 17:09                       ` George Dunlap
  1 sibling, 0 replies; 74+ messages in thread
From: Ian Jackson @ 2016-09-21 15:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Jennifer Herbert, George Dunlap, TimDeegan, David Vrabel,
	Anthony Perard, xen-devel, dgdegra

Jan Beulich writes ("Re: [Xen-devel] Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]"):
> On 21.09.16 at 15:24, <ian.jackson@eu.citrix.com> wrote:
> > Would this be enough of an improvement, for you, to be worth the
> > additional type name clutter etc. ?
> 
> Again this looks like much clutter with little benefit to me, i.e. I'd
> then rather go with the unmodified original proposal. That's largely
> because nothing really enforces anyone to use the (disconnected)
> xen_dmop_foo_entry type. I.e. it continues to be a matter of the
> programmer's and the reviewers' attention/discipline.

Fair enough.

Thanks for your opinions.

Regards,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]
  2016-09-21 13:56                     ` Jan Beulich
  2016-09-21 15:06                       ` Ian Jackson
@ 2016-09-21 17:09                       ` George Dunlap
  2016-09-22  8:47                         ` Jan Beulich
  1 sibling, 1 reply; 74+ messages in thread
From: George Dunlap @ 2016-09-21 17:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: StefanoStabellini, Wei Liu, Ian Jackson, TimDeegan,
	Andrew Cooper, David Vrabel, Jennifer Herbert, Anthony Perard,
	xen-devel, Daniel De Graaf

On Wed, Sep 21, 2016 at 2:56 PM, Jan Beulich <JBeulich@suse.com> wrote:
> Again this looks like much clutter with little benefit to me, i.e. I'd
> then rather go with the unmodified original proposal. That's largely
> because nothing really enforces anyone to use the (disconnected)
> xen_dmop_foo_entry type. I.e. it continues to be a matter of the
> programmer's and the reviewers' attention/discipline.

But is "Must use hypercall dmop, subop foo with struct dmop_foo" any
different than "Must use hypercall bar with struct bar"?

In theory there could be a mismatch between the struct libxc expected
to use for a whole hypercall with the struct Xen expected to find for
an entire hypercall. But in practice that never happens because we set
up the call with the appropriate struct once and then never change it.
(That is, we may change the struct elements, but not the name.)  This
seems to me like a fairly similar case.

Note that I'm not arguing for the extra "clutter" per se, I just think
that it will be pretty effective in mitigating the type safety issue.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages]
  2016-09-21 17:09                       ` George Dunlap
@ 2016-09-22  8:47                         ` Jan Beulich
  0 siblings, 0 replies; 74+ messages in thread
From: Jan Beulich @ 2016-09-22  8:47 UTC (permalink / raw)
  To: George Dunlap
  Cc: StefanoStabellini, Wei Liu, Andrew Cooper, Ian Jackson,
	TimDeegan, David Vrabel, Jennifer Herbert, Anthony Perard,
	xen-devel, Daniel De Graaf

>>> On 21.09.16 at 19:09, <George.Dunlap@eu.citrix.com> wrote:
> On Wed, Sep 21, 2016 at 2:56 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> Again this looks like much clutter with little benefit to me, i.e. I'd
>> then rather go with the unmodified original proposal. That's largely
>> because nothing really enforces anyone to use the (disconnected)
>> xen_dmop_foo_entry type. I.e. it continues to be a matter of the
>> programmer's and the reviewers' attention/discipline.
> 
> But is "Must use hypercall dmop, subop foo with struct dmop_foo" any
> different than "Must use hypercall bar with struct bar"?
> 
> In theory there could be a mismatch between the struct libxc expected
> to use for a whole hypercall with the struct Xen expected to find for
> an entire hypercall. But in practice that never happens because we set
> up the call with the appropriate struct once and then never change it.

Yes.

> (That is, we may change the struct elements, but not the name.)  This
> seems to me like a fairly similar case.

I don't think so - what I'm talking about here is the equivalent of
"we may change the struct elements". Such changes would go
unnoticed by the compiler the respective pseudo-struct-element
is a handle.

But anyway - I think we've beaten this horse to death, and I'm the
only one worried about type issues here. Therefore I think we
should just stop this discussion and go with the proposed model.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* QEMU XenServer/XenProject Working group meeting 29th September 2016
  2016-09-09 16:18 ` XenProject/XenServer QEMU working group minutes, 30th August 2016 Jennifer Herbert
  2016-09-12  7:16   ` Juergen Gross
@ 2016-10-14 18:01   ` Jennifer Herbert
  2016-10-18 19:54     ` Stefano Stabellini
  2017-02-28 18:18     ` QEMU XenServer/XenProject Working group meeting 10th February 2017 Jennifer Herbert
  1 sibling, 2 replies; 74+ messages in thread
From: Jennifer Herbert @ 2016-10-14 18:01 UTC (permalink / raw)
  To: xen-devel

QEMU XenServer/XenProject Working group meeting 29th September 2016
===================================================================

Attendees
---------

David Vrabel
Jennifer Herbert
Ian Jackson
Andrew Cooper
Paul Durrant
Lars Kurth

QEMU depriv
===========

DMOP
----

There has been agreement on list on the DMOP proposal.  The HVMCTL
patch series, which was proposed  should need only mechanical changes
to use it as a basis for DMOP.

Privcmd
-------

The privcmd changes should be fairly trivial to implement. Libxc
would need changing, but this code is also in the HVMCTL patch
series.  This mean only thing needed for QEMU it to call the restrict
ioctl, to enable it.  If restrict ioctl missing, an error would be
returned.  QEMU would probably want an option to it, to indicate
de-priv is required.  Given this option, the QEMU would raise an error
if the restrict ioctl was not present.

In order to avoid accidents due to ABI instability, old DMOP numbers would
be retired when a DMOP in changed in an ABI-incompatible way - there is
no shortage of new DMOP numbers.

Eventchan
---------

Eventchan has resections in 4.7, but the libxc parts need to be done.
This should not be much work.

XenStore
--------

For the non-pv part of QEMU, XenStore is only used in two places.
There is the DM state, and the physmap mechanism.  Although there is a
vague plan for replacing the physmap mechanism, it is some way off.

The DM state key is used for knowing when the qemu process is running
etcetera, QMP would seem to be an option to replace it - however there
is no (nice) way to wait on a socket until it has been opened.  One
solution might be to use Xenstore to let you know the QMP sockets
where available, before QEMU drops privileges,  and then QMP could be
used to know QEMU is in the running state.

To avoid the need to use xs-restrict, you would need to both replace
physmap and rework qemu startup procedure. The use of xs-restrict would
be more expedient, and does not look to need that much work.

Discussion was had over how secure it would be to allow a guest access
to these Xenstore keys - it was concluded that a guest could mostly
only mess itself up.  If I guest attempted to prevent itself from being
migrated, the tool stack time it out, and could kill it.

There followed a discussion on the Xenbus protocol, and additions
needed.  The aim is to merely restrict the permission for the command,
to that of the guest who's domID you provide.  It was proposed that
it uses the header as is, with its  16 bytes, with the command
'one-time-restrict' , and then the payload would have two additional
field at the start.  These two field would correspond to the domid to
restrict as, and the real command. Transaction ID and tags would be
taken from the real header.

Although inter domain xs-restrict is not specifically needed for this
project, it is thought it might be a blocking items for upstream
acceptance.  It it thoughts these changes would not require that much
work to implement, and may be useful in use use cases. Only a few
changes to QEMU would be needed, and libxl should be able to track
QEMU versions.  Ian Jackson volunteered to look at this, with David
helping  with the kernel bits.  Ian won't have time to look at this
until after Xen 4.8 is released.

There discussion about what may fail once privileges are taken away,
which would include CDs and PCI pass though.  It is thought the full
list can only be known by trying.  Not everything needs to work for
acceptance upstream, such as PCI pass though.   If such an
incompatible feature is needed, restrictions can be turned off.  These
problems can be fixed in a later phase, with CDs likely being at teh
top of the list.


Action items
------------

Hypervisor bits really needed first, but can't be done until 4.8 has
been.

Ian to look at the Xenstore items David is to look at the kernel
items.  Paul is to audit the HVMops, checking parameters etc;

It is too late to get this in 4.8, but it is desired to get this in
early into 4.9 so that there can be a period of stabilisation.  With
the release of 4.8 imminent, little work will happen until after that.
However Paul, David and Ian are asked to have a think about their
respective areas, and have a plan for when they can be done.  They are
welcome to start tackling them if they have time.



disaggregation
=============

A disaggregation proposal which had previously been posted to a QEMU
forum was discussed.  It was not previously accepted by all. The big
question was how to separate the device models from the machine, with
a particular point of contention being around PIIX and the idea of
starting a QEMU instance without one.  The general desire from us is
we want to have a specific device emulated and nothing else.  It is
suggested you would have a software interface between each device that
looked a software version of PCI.  The PIIX device could be attached to
CPU this pseudo PCI interface.  This would fit in well with how IOREQ
server and IOMMU works.  Although this sounds like a large
architectural change is wanted, its suggested that actually its just
that we're asking them to take a different stability and plug-ability
posture on the interfaces they already have.

This architectural issue is the cause behind lots of little
annoyances, which have been going on for years. Xen is having to make
up lots of strange stuff to keep QEMU happy, and there is confusion
over memory ownership.  Fixing the architecture  should make our lives
much easier.  These architectural issues are also making things
difficult for Intel, who are trying to work around the issue with Xen
changes, which may just worsen the problem.  This means this is
effectively blocking them.

It is proposed that instead of having a QEMU binary, what is really
wanted is a QEMU library.  With a library you could easily take the
bits needed, create your own main loop and link them to whatever
interface, IOREQ services or IPC mechanism is needed. There would be
no longer be a need for the IOREQ server to be in QEMU, which is
thought should be an attractive idea for the QEMU maintainers.  It is
also thought that other projects, such as the clear containers people
would also benefit from such an architecture.  The idea of spiltting
out the CPU code from the device code may even be attractive to KVM.

The code in the Xen tools directory, would be a small event loop,
using glib probably, thing that reads ioreq off a ring, and a
thing that speaks Xenstore. There would be a bunch of initialisation
calls, that calls into libqemu and initialise the various devices,with
device structures for them, indicating where they should be mapped and
so forth.   There would be no IDE code in our tree, and no ioreq
server in the QEMU tree.

The QEMU maintainers should be in favour of removing Xen specific code
from QEMU, and it is also thought that you could demonstrate how to
use this to make disaggregated device models for KVM's case.  It is
further postulated that there may be many people out there with dev
boards and experiments with FPGAs and strange  PCI stuff, they don't
want to wrestle with QEMUs PCI emulator.  With the libqemu, it may
just take 50 lines of of code for a random developer to plug some
hardware together and make a simulator.

There was discussion on if a halfway solution might be easier.
However it was concluded that such a solution would likely only
benefit Xen as a quick fix, and not as much as the full libqemu idea,
and so not look that appealing from QEMUs perspective. While the
the full libqemu idea would benefit many more people, allowing an
explosion of QEMU potential use cases and users.  More people using a
project should mean more contributors.

It was concluded that this was largely a political issue, and that we
need to find out what the objections really are.  If we where to
convince everyone of the benefit, then we'd probably need to step up
and to much of the work - however, this is still likely to be less
work then maintaining the current set-up.  There was further
discussion on who our allies might be, and the approach should take to
persuade people.  It was stressed that we need to sell the benefits of
this system i.e. "Releasing its full potential".

The alternative to the politics may be to simply fork the project
again - the previous fork lasted a decade.  However it should be much
better to cooperate, and so we much try.

Action item
-----------

Ian to reach out to Peter Maydell and discuss the issue, and to
consider writting down new proposal.
(updated: Probebly to early for writeup.)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: QEMU XenServer/XenProject Working group meeting 29th September 2016
  2016-10-14 18:01   ` QEMU XenServer/XenProject Working group meeting 29th September 2016 Jennifer Herbert
@ 2016-10-18 19:54     ` Stefano Stabellini
  2016-10-20 17:37       ` Lars Kurth
  2017-02-28 18:18     ` QEMU XenServer/XenProject Working group meeting 10th February 2017 Jennifer Herbert
  1 sibling, 1 reply; 74+ messages in thread
From: Stefano Stabellini @ 2016-10-18 19:54 UTC (permalink / raw)
  To: Jennifer Herbert; +Cc: xen-devel

I think this kind of calls should be announced on xen-devel before they
happen, to give a chance to other people to participate (I cannot
promise I would have participated but it is the principle that counts).

If I missed the announcement, I apologize.


On Fri, 14 Oct 2016, Jennifer Herbert wrote:
> XenStore
> --------
> 
> For the non-pv part of QEMU, XenStore is only used in two places.
> There is the DM state, and the physmap mechanism.  Although there is a
> vague plan for replacing the physmap mechanism, it is some way off.
> 
> The DM state key is used for knowing when the qemu process is running
> etcetera, QMP would seem to be an option to replace it - however there
> is no (nice) way to wait on a socket until it has been opened.  One
> solution might be to use Xenstore to let you know the QMP sockets
> where available, before QEMU drops privileges,  and then QMP could be
> used to know QEMU is in the running state.
> 
> To avoid the need to use xs-restrict, you would need to both replace
> physmap and rework qemu startup procedure. The use of xs-restrict would
> be more expedient, and does not look to need that much work.
> 
> Discussion was had over how secure it would be to allow a guest access
> to these Xenstore keys - it was concluded that a guest could mostly
> only mess itself up.  If I guest attempted to prevent itself from being
> migrated, the tool stack time it out, and could kill it.
> 
> There followed a discussion on the Xenbus protocol, and additions
> needed.  The aim is to merely restrict the permission for the command,
> to that of the guest who's domID you provide.  It was proposed that
> it uses the header as is, with its  16 bytes, with the command
> 'one-time-restrict' , and then the payload would have two additional
> field at the start.  These two field would correspond to the domid to
> restrict as, and the real command. Transaction ID and tags would be
> taken from the real header.
> 
> Although inter domain xs-restrict is not specifically needed for this
> project, it is thought it might be a blocking items for upstream
> acceptance.  It it thoughts these changes would not require that much
> work to implement, and may be useful in use use cases. Only a few
> changes to QEMU would be needed, and libxl should be able to track
> QEMU versions.  Ian Jackson volunteered to look at this, with David
> helping  with the kernel bits.  Ian won't have time to look at this
> until after Xen 4.8 is released.
> 
> There discussion about what may fail once privileges are taken away,
> which would include CDs and PCI pass though.  It is thought the full
> list can only be known by trying.  Not everything needs to work for
> acceptance upstream, such as PCI pass though.   If such an
> incompatible feature is needed, restrictions can be turned off.  These
> problems can be fixed in a later phase, with CDs likely being at teh
> top of the list.

One thing to note is that xs-restrict is unimplemented in cxenstored.

 
> disaggregation
> =============
> 
> A disaggregation proposal which had previously been posted to a QEMU
> forum was discussed.  It was not previously accepted by all. The big
> question was how to separate the device models from the machine, with
> a particular point of contention being around PIIX and the idea of
> starting a QEMU instance without one.

Right. In particular I tend to agree with the other QEMU maintainers
when they say: why ask for a PIIX3 compatible machine, when actually you
don't want to be PIIX3 compatible?


> The general desire from us is
> we want to have a specific device emulated and nothing else.

This is really not possible with QEMU, because QEMU is a machine
emulator, not a device emulator. BTW who wants this? I mean, why is this
part of the QEMU depriv discussion? It is not necessary. I think what we
want for QEMU depriv is to be able to build a QEMU PV machine with just
the PV backends in it, which is attainable with the current
architecture. I know there are use cases for having an emulator of just
one device, but I don't think they should be confused with the more
important underlying issue here, which is QEMU running with full
privileges.


> It is
> suggested you would have a software interface between each device that
> looked a software version of PCI.  The PIIX device could be attached to
> CPU this pseudo PCI interface.  This would fit in well with how IOREQ
> server and IOMMU works.  Although this sounds like a large
> architectural change is wanted, its suggested that actually its just
> that we're asking them to take a different stability and plug-ability
> posture on the interfaces they already have.
> 
> This architectural issue is the cause behind lots of little
> annoyances, which have been going on for years. Xen is having to make
> up lots of strange stuff to keep QEMU happy, and there is confusion
> over memory ownership.  Fixing the architecture  should make our lives
> much easier.  These architectural issues are also making things
> difficult for Intel, who are trying to work around the issue with Xen
> changes, which may just worsen the problem.  This means this is
> effectively blocking them.
> 
> It is proposed that instead of having a QEMU binary, what is really
> wanted is a QEMU library.  With a library you could easily take the
> bits needed, create your own main loop and link them to whatever
> interface, IOREQ services or IPC mechanism is needed. There would be
> no longer be a need for the IOREQ server to be in QEMU, which is
> thought should be an attractive idea for the QEMU maintainers.  It is
> also thought that other projects, such as the clear containers people
> would also benefit from such an architecture.  The idea of spiltting
> out the CPU code from the device code may even be attractive to KVM.

The idea of having a QEMU library has always been resisted upstream. It
takes the project in a very different direction. As QEMU maintainer I
don't know if such a thing would actually be good for the QEMU
community.

As I wrote, I would like to see a PV machine which can be built with
just the PV backends. This can be done without libqemu. We might also be
able to emulate a PIIX3 compatible machine without any CPUs, also
without libqemu.


> The code in the Xen tools directory, would be a small event loop,
> using glib probably, thing that reads ioreq off a ring, and a
> thing that speaks Xenstore. There would be a bunch of initialisation
> calls, that calls into libqemu and initialise the various devices,with
> device structures for them, indicating where they should be mapped and
> so forth.   There would be no IDE code in our tree, and no ioreq
> server in the QEMU tree.
>
> The QEMU maintainers should be in favour of removing Xen specific code
> from QEMU, and it is also thought that you could demonstrate how to
> use this to make disaggregated device models for KVM's case.  It is
> further postulated that there may be many people out there with dev
> boards and experiments with FPGAs and strange  PCI stuff, they don't
> want to wrestle with QEMUs PCI emulator.  With the libqemu, it may
> just take 50 lines of of code for a random developer to plug some
> hardware together and make a simulator.

The way QEMU community solved this problem is by using device tree to
assemble a machine at runtime with the things needed.  It works very
nicely. It can be done on x86 too as long as one doesn't try to pretend
to emulate an existing well defined machine such as PIIX3.


> There was discussion on if a halfway solution might be easier.
> However it was concluded that such a solution would likely only
> benefit Xen as a quick fix, and not as much as the full libqemu idea,
> and so not look that appealing from QEMUs perspective. While the
> the full libqemu idea would benefit many more people, allowing an
> explosion of QEMU potential use cases and users.  More people using a
> project should mean more contributors.
>
> It was concluded that this was largely a political issue, and that we
> need to find out what the objections really are.  If we where to
> convince everyone of the benefit, then we'd probably need to step up
> and to much of the work - however, this is still likely to be less
> work then maintaining the current set-up.  There was further
> discussion on who our allies might be, and the approach should take to
> persuade people.  It was stressed that we need to sell the benefits of
> this system i.e. "Releasing its full potential".
> 
> The alternative to the politics may be to simply fork the project
> again - the previous fork lasted a decade.  However it should be much
> better to cooperate, and so we much try.
> 
> Action item
> -----------
> 
> Ian to reach out to Peter Maydell and discuss the issue, and to
> consider writting down new proposal.
> (updated: Probebly to early for writeup.)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: QEMU XenServer/XenProject Working group meeting 29th September 2016
  2016-10-18 19:54     ` Stefano Stabellini
@ 2016-10-20 17:37       ` Lars Kurth
  2016-10-20 18:53         ` Stefano Stabellini
  0 siblings, 1 reply; 74+ messages in thread
From: Lars Kurth @ 2016-10-20 17:37 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Jennifer Herbert, xen-devel


> On 18 Oct 2016, at 20:54, Stefano Stabellini <sstabellini@kernel.org> wrote:
> 
> I think this kind of calls should be announced on xen-devel before they
> happen, to give a chance to other people to participate (I cannot
> promise I would have participated but it is the principle that counts).
> 
> If I missed the announcement, I apologize.

Stefano, the meeting started off as an internal meeting to brainstorm and share experiences and challenges we have with QEMU amongst different Citrix teams with a view to get a wider dialog started. Maybe we are at the stage where it makes sense to open it up. 

> On Fri, 14 Oct 2016, Jennifer Herbert wrote:
>> XenStore
>> --------
>> 
>> For the non-pv part of QEMU, XenStore is only used in two places.
>> There is the DM state, and the physmap mechanism.  Although there is a
>> vague plan for replacing the physmap mechanism, it is some way off.
>> 
>> The DM state key is used for knowing when the qemu process is running
>> etcetera, QMP would seem to be an option to replace it - however there
>> is no (nice) way to wait on a socket until it has been opened.  One
>> solution might be to use Xenstore to let you know the QMP sockets
>> where available, before QEMU drops privileges,  and then QMP could be
>> used to know QEMU is in the running state.
>> 
>> To avoid the need to use xs-restrict, you would need to both replace
>> physmap and rework qemu startup procedure. The use of xs-restrict would
>> be more expedient, and does not look to need that much work.
>> 
>> Discussion was had over how secure it would be to allow a guest access
>> to these Xenstore keys - it was concluded that a guest could mostly
>> only mess itself up.  If I guest attempted to prevent itself from being
>> migrated, the tool stack time it out, and could kill it.
>> 
>> There followed a discussion on the Xenbus protocol, and additions
>> needed.  The aim is to merely restrict the permission for the command,
>> to that of the guest who's domID you provide.  It was proposed that
>> it uses the header as is, with its  16 bytes, with the command
>> 'one-time-restrict' , and then the payload would have two additional
>> field at the start.  These two field would correspond to the domid to
>> restrict as, and the real command. Transaction ID and tags would be
>> taken from the real header.
>> 
>> Although inter domain xs-restrict is not specifically needed for this
>> project, it is thought it might be a blocking items for upstream
>> acceptance.  It it thoughts these changes would not require that much
>> work to implement, and may be useful in use use cases. Only a few
>> changes to QEMU would be needed, and libxl should be able to track
>> QEMU versions.  Ian Jackson volunteered to look at this, with David
>> helping  with the kernel bits.  Ian won't have time to look at this
>> until after Xen 4.8 is released.
>> 
>> There discussion about what may fail once privileges are taken away,
>> which would include CDs and PCI pass though.  It is thought the full
>> list can only be known by trying.  Not everything needs to work for
>> acceptance upstream, such as PCI pass though.   If such an
>> incompatible feature is needed, restrictions can be turned off.  These
>> problems can be fixed in a later phase, with CDs likely being at teh
>> top of the list.
> 
> One thing to note is that xs-restrict is unimplemented in cxenstored.
> 
> 
>> disaggregation
>> =============
>> 
>> A disaggregation proposal which had previously been posted to a QEMU
>> forum was discussed.  It was not previously accepted by all. The big
>> question was how to separate the device models from the machine, with
>> a particular point of contention being around PIIX and the idea of
>> starting a QEMU instance without one.
> 
> Right. In particular I tend to agree with the other QEMU maintainers
> when they say: why ask for a PIIX3 compatible machine, when actually you
> don't want to be PIIX3 compatible?
> 
> 
>> The general desire from us is
>> we want to have a specific device emulated and nothing else.
> 
> This is really not possible with QEMU, because QEMU is a machine
> emulator, not a device emulator. BTW who wants this? I mean, why is this
> part of the QEMU depriv discussion? It is not necessary. I think what we
> want for QEMU depriv is to be able to build a QEMU PV machine with just
> the PV backends in it, which is attainable with the current
> architecture. I know there are use cases for having an emulator of just
> one device, but I don't think they should be confused with the more
> important underlying issue here, which is QEMU running with full
> privileges.
> 
> 
>> It is
>> suggested you would have a software interface between each device that
>> looked a software version of PCI.  The PIIX device could be attached to
>> CPU this pseudo PCI interface.  This would fit in well with how IOREQ
>> server and IOMMU works.  Although this sounds like a large
>> architectural change is wanted, its suggested that actually its just
>> that we're asking them to take a different stability and plug-ability
>> posture on the interfaces they already have.
>> 
>> This architectural issue is the cause behind lots of little
>> annoyances, which have been going on for years. Xen is having to make
>> up lots of strange stuff to keep QEMU happy, and there is confusion
>> over memory ownership.  Fixing the architecture  should make our lives
>> much easier.  These architectural issues are also making things
>> difficult for Intel, who are trying to work around the issue with Xen
>> changes, which may just worsen the problem.  This means this is
>> effectively blocking them.
>> 
>> It is proposed that instead of having a QEMU binary, what is really
>> wanted is a QEMU library.  With a library you could easily take the
>> bits needed, create your own main loop and link them to whatever
>> interface, IOREQ services or IPC mechanism is needed. There would be
>> no longer be a need for the IOREQ server to be in QEMU, which is
>> thought should be an attractive idea for the QEMU maintainers.  It is
>> also thought that other projects, such as the clear containers people
>> would also benefit from such an architecture.  The idea of spiltting
>> out the CPU code from the device code may even be attractive to KVM.
> 
> The idea of having a QEMU library has always been resisted upstream. It
> takes the project in a very different direction. As QEMU maintainer I
> don't know if such a thing would actually be good for the QEMU
> community.

We revisited the original disaggregation thread (Wei originally proposed the patches) and what we proposed at the time was a sort of a half-way house that was very Xen specific and not really of much use to anyone other QEMU downstream but Xen. Even then, opinions amongst QEMU maintainers were divided: some were in favour, some were not. But we would definitely need to make a good case, do some convincing upfront and address the concerns of the QEMU community and work with the QEMU maintainers from the get-go. As you rightly point out, such an approach does change some of the fundamental assumptions within QEMU and we wouldn't want to do this, if there are no benefits to QEMU. I think it is worthwhile trying this again. You may have some further insights, which would be quite valuable. 

Regards
Lars

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: QEMU XenServer/XenProject Working group meeting 29th September 2016
  2016-10-20 17:37       ` Lars Kurth
@ 2016-10-20 18:53         ` Stefano Stabellini
  0 siblings, 0 replies; 74+ messages in thread
From: Stefano Stabellini @ 2016-10-20 18:53 UTC (permalink / raw)
  To: Lars Kurth; +Cc: Stefano Stabellini, Jennifer Herbert, xen-devel

On Thu, 20 Oct 2016, Lars Kurth wrote:
> > On 18 Oct 2016, at 20:54, Stefano Stabellini <sstabellini@kernel.org> wrote:
> > 
> > I think this kind of calls should be announced on xen-devel before they
> > happen, to give a chance to other people to participate (I cannot
> > promise I would have participated but it is the principle that counts).
> > 
> > If I missed the announcement, I apologize.
> 
> Stefano, the meeting started off as an internal meeting to brainstorm and share experiences and challenges we have with QEMU amongst different Citrix teams with a view to get a wider dialog started. Maybe we are at the stage where it makes sense to open it up. 

No worries, I didn't mean to pick on you guys, as I wrote I am not sure
I would actually have participated, but I think the meeting would work
better for you and the Xen community if it was open. In fact I think
that we don't have enough open meetings in the Xen community in general:
for your information Julien and I are going to start organizing one for
Xen on ARM soon, with the intention of making it a regular monthly
meeting.


> > On Fri, 14 Oct 2016, Jennifer Herbert wrote:
> >> XenStore
> >> --------
> >> 
> >> For the non-pv part of QEMU, XenStore is only used in two places.
> >> There is the DM state, and the physmap mechanism.  Although there is a
> >> vague plan for replacing the physmap mechanism, it is some way off.
> >> 
> >> The DM state key is used for knowing when the qemu process is running
> >> etcetera, QMP would seem to be an option to replace it - however there
> >> is no (nice) way to wait on a socket until it has been opened.  One
> >> solution might be to use Xenstore to let you know the QMP sockets
> >> where available, before QEMU drops privileges,  and then QMP could be
> >> used to know QEMU is in the running state.
> >> 
> >> To avoid the need to use xs-restrict, you would need to both replace
> >> physmap and rework qemu startup procedure. The use of xs-restrict would
> >> be more expedient, and does not look to need that much work.
> >> 
> >> Discussion was had over how secure it would be to allow a guest access
> >> to these Xenstore keys - it was concluded that a guest could mostly
> >> only mess itself up.  If I guest attempted to prevent itself from being
> >> migrated, the tool stack time it out, and could kill it.
> >> 
> >> There followed a discussion on the Xenbus protocol, and additions
> >> needed.  The aim is to merely restrict the permission for the command,
> >> to that of the guest who's domID you provide.  It was proposed that
> >> it uses the header as is, with its  16 bytes, with the command
> >> 'one-time-restrict' , and then the payload would have two additional
> >> field at the start.  These two field would correspond to the domid to
> >> restrict as, and the real command. Transaction ID and tags would be
> >> taken from the real header.
> >> 
> >> Although inter domain xs-restrict is not specifically needed for this
> >> project, it is thought it might be a blocking items for upstream
> >> acceptance.  It it thoughts these changes would not require that much
> >> work to implement, and may be useful in use use cases. Only a few
> >> changes to QEMU would be needed, and libxl should be able to track
> >> QEMU versions.  Ian Jackson volunteered to look at this, with David
> >> helping  with the kernel bits.  Ian won't have time to look at this
> >> until after Xen 4.8 is released.
> >> 
> >> There discussion about what may fail once privileges are taken away,
> >> which would include CDs and PCI pass though.  It is thought the full
> >> list can only be known by trying.  Not everything needs to work for
> >> acceptance upstream, such as PCI pass though.   If such an
> >> incompatible feature is needed, restrictions can be turned off.  These
> >> problems can be fixed in a later phase, with CDs likely being at teh
> >> top of the list.
> > 
> > One thing to note is that xs-restrict is unimplemented in cxenstored.
> > 
> > 
> >> disaggregation
> >> =============
> >> 
> >> A disaggregation proposal which had previously been posted to a QEMU
> >> forum was discussed.  It was not previously accepted by all. The big
> >> question was how to separate the device models from the machine, with
> >> a particular point of contention being around PIIX and the idea of
> >> starting a QEMU instance without one.
> > 
> > Right. In particular I tend to agree with the other QEMU maintainers
> > when they say: why ask for a PIIX3 compatible machine, when actually you
> > don't want to be PIIX3 compatible?
> > 
> > 
> >> The general desire from us is
> >> we want to have a specific device emulated and nothing else.
> > 
> > This is really not possible with QEMU, because QEMU is a machine
> > emulator, not a device emulator. BTW who wants this? I mean, why is this
> > part of the QEMU depriv discussion? It is not necessary. I think what we
> > want for QEMU depriv is to be able to build a QEMU PV machine with just
> > the PV backends in it, which is attainable with the current
> > architecture. I know there are use cases for having an emulator of just
> > one device, but I don't think they should be confused with the more
> > important underlying issue here, which is QEMU running with full
> > privileges.
> > 
> > 
> >> It is
> >> suggested you would have a software interface between each device that
> >> looked a software version of PCI.  The PIIX device could be attached to
> >> CPU this pseudo PCI interface.  This would fit in well with how IOREQ
> >> server and IOMMU works.  Although this sounds like a large
> >> architectural change is wanted, its suggested that actually its just
> >> that we're asking them to take a different stability and plug-ability
> >> posture on the interfaces they already have.
> >> 
> >> This architectural issue is the cause behind lots of little
> >> annoyances, which have been going on for years. Xen is having to make
> >> up lots of strange stuff to keep QEMU happy, and there is confusion
> >> over memory ownership.  Fixing the architecture  should make our lives
> >> much easier.  These architectural issues are also making things
> >> difficult for Intel, who are trying to work around the issue with Xen
> >> changes, which may just worsen the problem.  This means this is
> >> effectively blocking them.
> >> 
> >> It is proposed that instead of having a QEMU binary, what is really
> >> wanted is a QEMU library.  With a library you could easily take the
> >> bits needed, create your own main loop and link them to whatever
> >> interface, IOREQ services or IPC mechanism is needed. There would be
> >> no longer be a need for the IOREQ server to be in QEMU, which is
> >> thought should be an attractive idea for the QEMU maintainers.  It is
> >> also thought that other projects, such as the clear containers people
> >> would also benefit from such an architecture.  The idea of spiltting
> >> out the CPU code from the device code may even be attractive to KVM.
> > 
> > The idea of having a QEMU library has always been resisted upstream. It
> > takes the project in a very different direction. As QEMU maintainer I
> > don't know if such a thing would actually be good for the QEMU
> > community.
> 
> We revisited the original disaggregation thread (Wei originally proposed the patches) and what we proposed at the time was a sort of a half-way house that was very Xen specific and not really of much use to anyone other QEMU downstream but Xen. Even then, opinions amongst QEMU maintainers were divided: some were in favour, some were not. But we would definitely need to make a good case, do some convincing upfront and address the concerns of the QEMU community and work with the QEMU maintainers from the get-go. As you rightly point out, such an approach does change some of the fundamental assumptions within QEMU and we wouldn't want to do this, if there are no benefits to QEMU. I think it is worthwhile trying this again. You may have some further insights, which would be quite valuable. 

OK, these are my 2 cents: for political, licensing and technical reasons
I don't think that pushing for libqemu is a worthwhile goal. But I think
that something along the lines of what Wei suggested originally could
still work, but I agree that we need to make it less Xen specific.

I think the way to generalize it is to disentangle CPU emulation,
architecture emulation (x86/ARM/etc.) and board emulation
(VersatilePB/CubieBoard/etc.). QEMU should be able to emulate a board
with no CPUs. Or the same board with an ARM or a PowerPC CPU. For
example QEMU should definitely be able to emulate a VersatilePB without
a CPU or with a non-ARM CPU. At that point we could ask QEMU to emulate
a PIIX3 board with no CPUs or a Xen PV board with no CPUs. This idea
comes from past conversations with the QEMU guys.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: QEMU XenServer/XenProject Working group meeting 10th February 2017
  2016-10-14 18:01   ` QEMU XenServer/XenProject Working group meeting 29th September 2016 Jennifer Herbert
  2016-10-18 19:54     ` Stefano Stabellini
@ 2017-02-28 18:18     ` Jennifer Herbert
  2017-06-05 13:48       ` QEMU XenServer/XenProject Working group meeting 10th May 2017 Jennifer Herbert
  1 sibling, 1 reply; 74+ messages in thread
From: Jennifer Herbert @ 2017-02-28 18:18 UTC (permalink / raw)
  To: xen-devel

QEMU XenServer/XenProject Working group meeting 10th February 2017
==================================================================

Attendance
----------

Andrew Cooper
Ian Jackson
Paul Durrent
Jennifer Herbert
George Dunlap
Igor Druzhinin
Sergey Dyasli


First topic: DMop
-----------------

After an introduction, Paul started by describing his recent progress 
with Dmop.
He has posted a privcmd series, including the dmop hypervisor parts.  
This adds the dmop ioctl and the restriction ioctl.  This restrict 
ioctl, restricts access of the dmop to that of the given domain, and 
prevents further access via the privcmd.

Paul has also written all the code for libdevicemodel -  4 patches. This 
code is linux specific, and will detect if its running on an OS it doest 
support (either old linux or windows), or where the privcmd driver 
doesnt support the dm ioctl.
He next intends add a compatibility layer  to libxenctl, which reroutes 
old xc calls to the new way.

This would make the make calls switch transparently, but you'd know at 
the time you made the restriction call, if it succeeded or not, and can 
use local policy to determine if to continue or not.
Following this, QEMU needs to be ported onto this new library.

Next was a lengthy discussion QEMU's mapcahe, and on how to secure mmap 
calls.  It was eventual concluded that Paul could just extend his 
current patches to also include the mmap calls.

It was asked if using this library would mean having two fds,  but this 
was said to be only true if you continued to keep xenctl open – which 
you shouldn’t since the new restricted ioctl fd should be able to do 
everything you might need the old one for.

Paul detailed how the privcmd driver now has a domid associated with 
each fd, which defaults to 'invalid'.  The restrict ioctl sets this 
domid, and from then on, every call from that fd is checked against that 
domain.  The restrict ioctl won't allow the domid to be changed again 
later.

It was suggested other calls should be checked, but Paul didn't think 
there where any others not already discussed.


Next topic: How to prevent QEMU running as root
-----------------------------------------------

There is a patch to so make QEMU change away from root user.  This is 
expected to make a number of things stop working:

* CD-ROM insert  -   Eject is expected to work, if you could insert one.
* Emulated hotplug - unsure.
* PCI-passthough – This is deemed out of scope since pci-passthough 
inherently unsafe anyway.  It should give a useful error message 
suggesting turning off de-priv.  This could be made to work in future.

It is thought that some of these problems could be solved by passing fds 
over QMP.  Furthermore, it was thought that DavidV. had looked, and 
found you could already do this.  If so, only toolstack changes would be 
needed.  Making emulated network hotplug or IDE hotplug work is probably 
not important.

NOTE: Need to check QMP and it applicability to this problem.


Next Topic: Xenstore
--------------------

Juergen has a proposal for xenstored, to have something similar to dm-op 
ioctl.  Each command can have a prefix which requests it's executed with 
a specified domain security context.

Its also necessary to run the multilpe QEMUs, such that the PV back-ends 
can be run in a separate instance from the emulators.  In doing this, 
you also need to separate out the xenstore paths, such that each QEMU 
uses its own xenstore branch.
Part of the reason is that the backend paths that you are using to 
control the emulated deprivileged QEMU, have to be writeable by the 
guest. This is because the emulated QEMU is going to have to write to them.

The emulator patch ID stuff is already upstream – and can be invoked.   
But toostack has to be able to cope.  Ian wrote a LibXL patch series 
(20~30 patches) 18 months ago, but that has bit rotted.  This patch was 
said to be surprisingly long.

Andrew is concerned that giving a guest permission to write to the 
xenstore paths used by its QEMU may cause a significant security issue, 
while others think it would merely offer a new way for it to crash 
itself.  Andrew to inspect Ian's patch series to evaluate this risk.

There are only a hand full of xenstore paths needed by the emulated QEMU 
– which will only drop as QMP use increased.
Another thing to consider is you'll have to be able to talk to multiple 
QEMU's via QMP – knowing which is which!

There was a suggestion that you could differentiate between QEMU pv and 
QEMU emulated by how your communicating with it – but this wouldn't 
really be possible, as QEMU upstream won't accept anything using 
xenstore, if it might be useful to more then just the xen community.


XenServer is currently working towards moving to QEMU upstream, with 
it's core functionally currently working using temporary python shims, 
translating the API calls.  If depriv is unstreamed in time, XenServer 
won't have to transition the APIs twice.

Given this, it was decided it would be easiest for LibXL to tested with 
the changes first, and see how well it QEMU works as not-root.

Next Topic:  Drop Caps
----------------------

Andrew points out that there is more to running without root privileges 
then just changing your UID.  He recalls how a past college has needed 
to us drop-caps, and moved things to an anonymous name space, to prevent 
things like being able to open TCP sockets.

Doing such operations are Linux specific, but as long as such operations 
are enabled by an option, this is ok.  It was concluded these where 
incremental benefits which could be done later, after we got the basics 
working.  This issue also not Xen specific.



Action Items
------------

Paul to carry on with Xen device model work.

Ian to re-factor LibXL after libxl split.

Ian to negotiate xenstore restrictions

Ian to check QMP can take an fd and use for CD insert and emulated hotplug.

Andrew to look over Ian's patchqueue to see how bad giving a guest the 
extra xenstore permissions
would be.

Jennifer to continue leading Xenservers change over from QEMU trad to 
the use of QEMU upstream.   XenServer is making slow progress here, but 
not highest priority.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* QEMU XenServer/XenProject Working group meeting 10th May 2017
  2017-02-28 18:18     ` QEMU XenServer/XenProject Working group meeting 10th February 2017 Jennifer Herbert
@ 2017-06-05 13:48       ` Jennifer Herbert
  0 siblings, 0 replies; 74+ messages in thread
From: Jennifer Herbert @ 2017-06-05 13:48 UTC (permalink / raw)
  To: xen-devel

QEMU XenServer/XenProject Working group meeting 5th May 2017
=============================================================

Attendees:
* Paul Durrant
* Andrew Cooper
* Ian Jackson
* Jenny Herbert
* Igor Druzhinin
* Simon Crow
* Marcus Granado

Reviewed previous action points

* Paul to carry on with Xen device model work. - Done.
* Ian - No progress with re factoring libxl split, XenStore restrictions
   or using and FD for QMP cd insert and emulated hotplug, due to feature
   freeze on 4.8.9.
* Andrew to look over Ian's patch series to see how bad extra XenStore
   permission would be.  - No.
* Jenny – XenServer is continuing to make slow progress. We have XenCentre
   patch that means it can talk to QEMU trad, working on some bugs.  Various
   other small bit of work done.

Andrew to double check dm-op sub ops for continuation support.

XenServer uses XenStore to communicate the VNC QEMU clipboard to the
guest and back.  It was concluded this isnt nice, as has caused
veriose security problems in the past.  New plan is to implement this
using a shared page.

Next XenStore restriction was discussed. Including an approach
involving using a shared ring buffer and removing the previous
xs-restrict.  Instead implementing an xs restriction prefixing command
within the xenbus command.  Slightly fiddly, but small number of
parts.  Its doable and not controversial.

Andrew had a diffident plan, to remove any use of XenStore from QEMU –
specifically for the hvm case.  Currently little us in the hvm case.
Two uses should be trivial to implement in QMP, and the other is
physmap, which Igor has been working on.

Its agreed the phymap key needs removing, and that removing the last
few uses of XenStore would be a good idea, however, if it seemed that
dealing with the physmap issue where to take a long time, the
xs-ristrict could be used as an intermidiate plan, which would allow
this project to move on while physmap or any other complications are
being sorted out.

Andrew is not convinced that allowing a guest to fiddle with the
phymap keys (as wolud have to be permitted under the xs-restict
scheme) would be safe.  Certainly the guest could destroy itself,
which in itself is permit-able, but as it changes the interval virtual
memory layout for QEMU, there might be some exploit hidden there –
this would need to be checked.

The conversation turns to the physmap keys, being the biggest XenStore
concern.  Andrew suggests that the correct way to fix it in a
compatible way would be if grant map foreign took in a pointer that
would be nmaped. The pointer would allow the range to be mapped
exactly where it pointed.  A new library call to xen foreign map could
be written, but the complication of needing compatibility for QEMU
with older versions of Xen was raised.

Ian suggested that this wasn’t really a problem, since we could do the
old thing with the old versions.  Libxl already tests for the version
of QEMU, and so if it finds this too old, it does the phymap keys
thing, otherwise, it can use QEMU-depriv including the the new physmap
mechanism.

The new physmap mechanism would not require a substitute for the
physmap keys, as QEMU already has all the information it needs.  The
keys where entirely to allow two parts of startup logic in QEMU to be
reversed.  Andrew  says that Igor has a solution but doestn think its
upstreamable.

Ian suggests that given the size of the xs-restrict patch queue, if
can fix the physmap issue relatively easily, we should, and then he
could drop most of the xs-restict patchque.  Ian offers to help with
the libxl part of the phymap work.

Igor explains an approach he tried hit a block of QXL, which
initialised certain registers and write then into various memory
locations.  He tried in each place to insert an  'if Xen and
runstate=resuming', doent touch.   But the fix is very intrusive in
terms of QXL code, and so he didn’t try to upstream it.

The idea of using helper functions was discussed.  Other helper
functions have been used before for similar problems.  A function
could be created to access vram, instead of just writing to a pointer.
All the access parts of QXL code could eb redirected though this
helper function, and that would be likely ustreamable.  In particular,
its been suggested before that helper function could be used for range
checking.  They could be added for range checking, and then also
modified for with the 'if xen and resuming' clauses.

Igor explains how another approach he had been looking at was to
change the order of QEMU startup, and move the memory map, and
actually map the memory where QEMU expects it to be mapped.  Here
again, there was the issue of compatibility.  The compact layer is
still needed to work with old versions of libxc.  It was discussed how
it would have QEMU would have to be able to work with both skews.  It
could be decided at compile time, as you at this point if you have the
new function call.  It the new function call is not available, it
would default to old behaviour.  A goal of Igors was to remove the
phymas XenStore keys code altogether, but for the depriv project, the
important bit want they where not used.  It was also noted that
compact code can eventually be retired, when its felt no one is using
it any more.

The xenserver case, using QEMU-trad, XenServer hasn’t needed to create
the physmap keys, and would rather not start.  A benefit for libXL is
it could get rid of a magic record for the migration stream.

The new libxen foreign mem call would still be need. This is too late
for 4.9, but that not a problem, and XenServer can backport.   RFCs
can be posted sooner.

The conversation returns to XenStore.  The patchque can be
substantially cut down, which is a good thing.  The other uses of
XenStore are thought to be 'start log dirty' and the 'shutdown' keys.
It was thought it was likely that QEMU might already have QMP commands
to do these things, but if not, they are likely to be very happy to
allow them to be added.  After this, XenStore can be moved into the PV
specific part of the code.

Jenny suggested ring0 could put an item on their backlog to look at
log dirty and the shutdown code.  They should also do a scan for other
uses.  The xentore clipboard problem is entirely a XenServer problem,
and they can implement the new clipboard mechanism in their own time.

Ian still on the hook for the CD and libxl work.  Will still have to
do the two QEMUs code, and all the code to do with starting and
stopping two QEMUs.  But much less painful.

Jenny has already created a confluence page linking to previously
posted meeting notes, and it was suggested this could be extended to
include action points, links to created code or links to related
discussions.

Action Items
------------

* If Andrew to double check dm-op sub ops for continuation support
* Igor to implement and post patches to create new mem map function,
   and to implement the new phymap mechanism in QEMU.
* Ian to look a using and FD for QMP cd insert and emulated hotplug,
    as well as splitting two QEMUs.
* Jenny to create internal jira tickets looking at removing other
   used of XenStore in QEMU, as well update the  confluence page.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2017-06-05 13:48 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-28 17:01 XenProject/XenServer QEMU working group, Friday 8th July, 2016, 15:00 Jennifer Herbert
2016-08-01 11:32 ` Device model operation hypercall (DMOP, re qemu depriv) Ian Jackson
2016-08-01 12:41   ` Jan Beulich
2016-08-02 11:38     ` Wei Liu
2016-08-02 11:58       ` Jan Beulich
2016-08-02 13:02         ` David Vrabel
2016-08-02 13:29           ` Jan Beulich
2016-08-03 10:29       ` Ian Jackson
2016-08-03 12:03         ` Jan Beulich
2016-08-03 13:37           ` Ian Jackson
2016-08-03 14:16             ` Jan Beulich
2016-08-03 14:21               ` George Dunlap
2016-08-03 16:10                 ` Ian Jackson
2016-08-03 16:18                   ` Jan Beulich
2016-08-04 11:21                     ` Ian Jackson
2016-08-04 13:24                       ` Jan Beulich
2016-08-05 16:28                         ` Ian Jackson
2016-08-08 11:18                           ` Jan Beulich
2016-08-08 13:46                             ` Ian Jackson
2016-08-08 14:07                               ` Jan Beulich
2016-08-26 11:38                                 ` Ian Jackson
2016-08-26 12:58                                   ` Jan Beulich
2016-08-26 14:35                                     ` Ian Jackson
2016-08-26 15:13                                       ` Jan Beulich
2016-08-30 11:02                                         ` Ian Jackson
2016-08-30 21:47                                           ` Stefano Stabellini
2016-09-02 14:08                                           ` Wei Liu
2016-08-09 10:29                               ` Jan Beulich
2016-08-09 10:48                                 ` Ian Jackson
2016-08-09 11:30                                   ` Jan Beulich
2016-08-12  9:44                                     ` George Dunlap
2016-08-12 11:50                                       ` Jan Beulich
2016-08-15  9:39                                         ` George Dunlap
2016-08-15 10:19                                           ` Jan Beulich
2016-08-15 10:47                                             ` George Dunlap
2016-08-15 11:20                                               ` Jan Beulich
2016-08-15 12:07                                                 ` Ian Jackson
2016-08-15 14:20                                                   ` Jan Beulich
2016-08-15 14:57                                                 ` George Dunlap
2016-08-15 15:22                                                   ` Jan Beulich
2016-08-15 14:50                                 ` David Vrabel
2016-08-15 15:24                                   ` Jan Beulich
2016-08-26 11:29                                     ` Ian Jackson
2016-08-26 12:58                                       ` Jan Beulich
2016-08-02 11:37   ` Wei Liu
2016-08-02 11:42     ` George Dunlap
2016-08-02 12:34       ` Wei Liu
2016-09-09 15:16   ` Jennifer Herbert
2016-09-09 15:34     ` David Vrabel
2016-09-12 13:47     ` George Dunlap
2016-09-12 14:32     ` Jan Beulich
2016-09-13 10:37       ` George Dunlap
2016-09-13 11:53         ` Jan Beulich
2016-09-13 16:07       ` David Vrabel
2016-09-14  9:51         ` Jan Beulich
2016-09-21 11:21           ` Ian Jackson
2016-09-21 11:28             ` George Dunlap
2016-09-21 11:58               ` Jan Beulich
2016-09-21 11:55             ` Jan Beulich
2016-09-21 12:23               ` Device model operation hypercall (DMOP, re qemu depriv) [and 1 more messages] Ian Jackson
2016-09-21 12:48                 ` Jan Beulich
2016-09-21 13:24                   ` Ian Jackson
2016-09-21 13:56                     ` Jan Beulich
2016-09-21 15:06                       ` Ian Jackson
2016-09-21 17:09                       ` George Dunlap
2016-09-22  8:47                         ` Jan Beulich
2016-09-09 16:18 ` XenProject/XenServer QEMU working group minutes, 30th August 2016 Jennifer Herbert
2016-09-12  7:16   ` Juergen Gross
2016-10-14 18:01   ` QEMU XenServer/XenProject Working group meeting 29th September 2016 Jennifer Herbert
2016-10-18 19:54     ` Stefano Stabellini
2016-10-20 17:37       ` Lars Kurth
2016-10-20 18:53         ` Stefano Stabellini
2017-02-28 18:18     ` QEMU XenServer/XenProject Working group meeting 10th February 2017 Jennifer Herbert
2017-06-05 13:48       ` QEMU XenServer/XenProject Working group meeting 10th May 2017 Jennifer Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.