xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/ept: limit calls to memory_type_changed()
@ 2022-09-22 16:05 Roger Pau Monne
  2022-09-22 19:21 ` Jan Beulich
  2022-09-26 18:03 ` Andrew Cooper
  0 siblings, 2 replies; 12+ messages in thread
From: Roger Pau Monne @ 2022-09-22 16:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini

memory_type_changed() is currently only implemented for Intel EPT, and
results in the invalidation of EMT attributes on all the entries in
the EPT page tables.  Such invalidation causes EPT_MISCONFIG vmexits
when the guest tries to access any gfns for the first time, which
results in the recalculation of the EMT for the accessed page.  The
vmexit and the recalculations are expensive, and as such should be
avoided when possible.

Remove the call to memory_type_changed() from
XEN_DOMCTL_memory_mapping: there are no modifications of the
iomem_caps ranges anymore that could alter the return of
cache_flush_permitted() from that domctl.

Calls to memory_type_changed() resulting from changes to the domain
iomem_caps or ioport_caps ranges are only relevant for EMT
calculations if the IOMMU is not enabled, and the call has resulted in
a change to the return value of cache_flush_permitted().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I feel it's a bit weird to have calls to memory_type_changed() in
common domctl code - for once the domctl that trigger the call doesn't
change memory types, just adds or removes ranges from iomem_caps
(which in turn affects the behaviour of epte_get_entry_emt()).
---
 xen/arch/x86/domctl.c | 18 ++++++++++++++++--
 xen/common/domctl.c   | 11 ++++++++---
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 020df615bd..f1150dffa5 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -222,6 +222,7 @@ long arch_do_domctl(
         unsigned int fp = domctl->u.ioport_permission.first_port;
         unsigned int np = domctl->u.ioport_permission.nr_ports;
         int allow = domctl->u.ioport_permission.allow_access;
+        bool flush_permitted = cache_flush_permitted(d);
 
         if ( (fp + np) <= fp || (fp + np) > MAX_IOPORTS )
             ret = -EINVAL;
@@ -232,7 +233,13 @@ long arch_do_domctl(
             ret = ioports_permit_access(d, fp, fp + np - 1);
         else
             ret = ioports_deny_access(d, fp, fp + np - 1);
-        if ( !ret )
+        if ( !ret && !is_iommu_enabled(d) &&
+             flush_permitted != cache_flush_permitted(d) )
+            /*
+             * Only flush if the output of cache_flush_permitted() changes and
+             * IOMMU is not enabled for the domain, otherwise it makes no
+             * difference for EMT calculation purposes.
+             */
             memory_type_changed(d);
         break;
     }
@@ -586,6 +593,7 @@ long arch_do_domctl(
         struct hvm_domain *hvm;
         struct g2m_ioport *g2m_ioport;
         int found = 0;
+        bool flush_permitted = cache_flush_permitted(d);
 
         ret = -EOPNOTSUPP;
         if ( !is_hvm_domain(d) )
@@ -666,7 +674,13 @@ long arch_do_domctl(
                        "ioport_map: error %ld denying dom%d access to [%x,%x]\n",
                        ret, d->domain_id, fmp, fmp + np - 1);
         }
-        if ( !ret )
+        if ( !ret && !is_iommu_enabled(d) &&
+             flush_permitted != cache_flush_permitted(d) )
+            /*
+             * Only flush if the output of cache_flush_permitted() changes and
+             * IOMMU is not enabled for the domain, otherwise it makes no
+             * difference for EMT calculation purposes.
+             */
             memory_type_changed(d);
         break;
     }
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 452266710a..1f2f2dfcc2 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -703,6 +703,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
         unsigned long mfn = op->u.iomem_permission.first_mfn;
         unsigned long nr_mfns = op->u.iomem_permission.nr_mfns;
         int allow = op->u.iomem_permission.allow_access;
+        bool flush_permitted = cache_flush_permitted(d);
 
         ret = -EINVAL;
         if ( (mfn + nr_mfns - 1) < mfn ) /* wrap? */
@@ -716,7 +717,13 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
             ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1);
         else
             ret = iomem_deny_access(d, mfn, mfn + nr_mfns - 1);
-        if ( !ret )
+        if ( !ret && !is_iommu_enabled(d) &&
+             flush_permitted != cache_flush_permitted(d) )
+            /*
+             * Only flush if the output of cache_flush_permitted() changes and
+             * IOMMU is not enabled for the domain, otherwise it makes no
+             * difference for effective cache attribute calculation purposes.
+             */
             memory_type_changed(d);
         break;
     }
@@ -778,8 +785,6 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
                        "memory_map: error %ld removing dom%d access to [%lx,%lx]\n",
                        ret, d->domain_id, mfn, mfn_end);
         }
-        /* Do this unconditionally to cover errors on above failure paths. */
-        memory_type_changed(d);
         break;
     }
 
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-22 16:05 [PATCH] x86/ept: limit calls to memory_type_changed() Roger Pau Monne
@ 2022-09-22 19:21 ` Jan Beulich
  2022-09-23  8:35   ` Roger Pau Monné
  2022-09-26 18:03 ` Andrew Cooper
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2022-09-22 19:21 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On 22.09.2022 18:05, Roger Pau Monne wrote:
> memory_type_changed() is currently only implemented for Intel EPT, and
> results in the invalidation of EMT attributes on all the entries in
> the EPT page tables.  Such invalidation causes EPT_MISCONFIG vmexits
> when the guest tries to access any gfns for the first time, which
> results in the recalculation of the EMT for the accessed page.  The
> vmexit and the recalculations are expensive, and as such should be
> avoided when possible.
> 
> Remove the call to memory_type_changed() from
> XEN_DOMCTL_memory_mapping: there are no modifications of the
> iomem_caps ranges anymore that could alter the return of
> cache_flush_permitted() from that domctl.

I certainly agree - this was an oversight when the two aspects were
split. One might argue this is a (performance) fix to the earlier
commit, and hence might want to go on its own with a Fixes: tag.

> Calls to memory_type_changed() resulting from changes to the domain
> iomem_caps or ioport_caps ranges are only relevant for EMT
> calculations if the IOMMU is not enabled, and the call has resulted in
> a change to the return value of cache_flush_permitted().

I'm less certain here: These shouldn't be frequent operations, so
their impact on the guest should be limited?

And if we were to restrict the calls, I think we need to clearly
tie together the various places which need updating together in
case e.g. the condition in epte_get_entry_emt() is changed.
Minimally by way of comments, but maybe by way of a small helper
function (for which I can't seem to be able to think of a good
name) sitting next to epte_get_entry_emt().

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> I feel it's a bit weird to have calls to memory_type_changed() in
> common domctl code - for once the domctl that trigger the call doesn't
> change memory types, just adds or removes ranges from iomem_caps
> (which in turn affects the behaviour of epte_get_entry_emt()).

Do you have a better suggestion?

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-22 19:21 ` Jan Beulich
@ 2022-09-23  8:35   ` Roger Pau Monné
  2022-09-26  7:33     ` Jan Beulich
  0 siblings, 1 reply; 12+ messages in thread
From: Roger Pau Monné @ 2022-09-23  8:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote:
> On 22.09.2022 18:05, Roger Pau Monne wrote:
> > memory_type_changed() is currently only implemented for Intel EPT, and
> > results in the invalidation of EMT attributes on all the entries in
> > the EPT page tables.  Such invalidation causes EPT_MISCONFIG vmexits
> > when the guest tries to access any gfns for the first time, which
> > results in the recalculation of the EMT for the accessed page.  The
> > vmexit and the recalculations are expensive, and as such should be
> > avoided when possible.
> > 
> > Remove the call to memory_type_changed() from
> > XEN_DOMCTL_memory_mapping: there are no modifications of the
> > iomem_caps ranges anymore that could alter the return of
> > cache_flush_permitted() from that domctl.
> 
> I certainly agree - this was an oversight when the two aspects were
> split. One might argue this is a (performance) fix to the earlier
> commit, and hence might want to go on its own with a Fixes: tag.

Was wondering myself, didn't add the 'Fixes:' tag because of the extra
content.

> > Calls to memory_type_changed() resulting from changes to the domain
> > iomem_caps or ioport_caps ranges are only relevant for EMT
> > calculations if the IOMMU is not enabled, and the call has resulted in
> > a change to the return value of cache_flush_permitted().
> 
> I'm less certain here: These shouldn't be frequent operations, so
> their impact on the guest should be limited?

Citrix has an use case for vGPU where IOMMU regions are added and
removed during guest runtime.  Such functionality makes uses of both
XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping.

While the memory_type_changed() call in XEN_DOMCTL_memory_mapping
seems to be the most problematic performance wise, I though it was
nice to try to avoid memory_type_changed() as much as possible, as
those tax the guest quite heavily with EPT_MISCONFIG faults and the
recalculation logic.

> And if we were to restrict the calls, I think we need to clearly
> tie together the various places which need updating together in
> case e.g. the condition in epte_get_entry_emt() is changed.
> Minimally by way of comments, but maybe by way of a small helper
> function (for which I can't seem to be able to think of a good
> name) sitting next to epte_get_entry_emt().

Such helper function is also kind of problematic, as it would have to
live in p2m-ept.c but be used in domctl.c and x86/domctl.c?  It would
have to go through the p2m_domain indirection structure.

Do you have any suggestions about how the function should look like?
I'm afraid the fact it needs the previous cache_flush_permitted()
value makes it kind of weird to encapsulate.

I've attempted to add comments to make it clear why the new checks are
added, but I would also need to add a comment to epte_get_entry_emt()
to notice changes in the condition need to be propagated to call sites
of memory_type_changed().

> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > I feel it's a bit weird to have calls to memory_type_changed() in
> > common domctl code - for once the domctl that trigger the call doesn't
> > change memory types, just adds or removes ranges from iomem_caps
> > (which in turn affects the behaviour of epte_get_entry_emt()).
> 
> Do you have a better suggestion?

No, not really, because we need the return value of
cache_flush_permitted() before and after the changes, so it's not as
easy as introducing a single helper sadly.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-23  8:35   ` Roger Pau Monné
@ 2022-09-26  7:33     ` Jan Beulich
  2022-09-26 14:50       ` Roger Pau Monné
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2022-09-26  7:33 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On 23.09.2022 10:35, Roger Pau Monné wrote:
> On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote:
>> On 22.09.2022 18:05, Roger Pau Monne wrote:
>>> memory_type_changed() is currently only implemented for Intel EPT, and
>>> results in the invalidation of EMT attributes on all the entries in
>>> the EPT page tables.  Such invalidation causes EPT_MISCONFIG vmexits
>>> when the guest tries to access any gfns for the first time, which
>>> results in the recalculation of the EMT for the accessed page.  The
>>> vmexit and the recalculations are expensive, and as such should be
>>> avoided when possible.
>>>
>>> Remove the call to memory_type_changed() from
>>> XEN_DOMCTL_memory_mapping: there are no modifications of the
>>> iomem_caps ranges anymore that could alter the return of
>>> cache_flush_permitted() from that domctl.
>>
>> I certainly agree - this was an oversight when the two aspects were
>> split. One might argue this is a (performance) fix to the earlier
>> commit, and hence might want to go on its own with a Fixes: tag.
> 
> Was wondering myself, didn't add the 'Fixes:' tag because of the extra
> content.
> 
>>> Calls to memory_type_changed() resulting from changes to the domain
>>> iomem_caps or ioport_caps ranges are only relevant for EMT
>>> calculations if the IOMMU is not enabled, and the call has resulted in
>>> a change to the return value of cache_flush_permitted().
>>
>> I'm less certain here: These shouldn't be frequent operations, so
>> their impact on the guest should be limited?
> 
> Citrix has an use case for vGPU where IOMMU regions are added and
> removed during guest runtime.  Such functionality makes uses of both
> XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping.

I see. Maybe this would want saying in the description, to express
that there's little expected benefit for upstream.

> While the memory_type_changed() call in XEN_DOMCTL_memory_mapping
> seems to be the most problematic performance wise, I though it was
> nice to try to avoid memory_type_changed() as much as possible, as
> those tax the guest quite heavily with EPT_MISCONFIG faults and the
> recalculation logic.

Trying to avoid this is certainly desirable, I agree. But we need
to make sure that it's not "easy" to break things by touching one
place but leaving others alone which really would need keeping in
sync. Therefore I'd see such added logic as acceptable only if the
risk towards future changes is sufficiently low.

>> And if we were to restrict the calls, I think we need to clearly
>> tie together the various places which need updating together in
>> case e.g. the condition in epte_get_entry_emt() is changed.
>> Minimally by way of comments, but maybe by way of a small helper
>> function (for which I can't seem to be able to think of a good
>> name) sitting next to epte_get_entry_emt().
> 
> Such helper function is also kind of problematic, as it would have to
> live in p2m-ept.c but be used in domctl.c and x86/domctl.c?  It would
> have to go through the p2m_domain indirection structure.

It would need abstraction at the arch level as well as for !HVM configs
on x86. I'm not sure the indirection layer would actually be needed, as
the contents of the function - despite wanting placing in p2m-ept.c -
isn't really vendor dependent. (If AMD/SVM gained a need for a similar
helper, things would nee re-evaluating.)

> Do you have any suggestions about how the function should look like?
> I'm afraid the fact it needs the previous cache_flush_permitted()
> value makes it kind of weird to encapsulate.

Indeed.

> I've attempted to add comments to make it clear why the new checks are
> added, but I would also need to add a comment to epte_get_entry_emt()
> to notice changes in the condition need to be propagated to call sites
> of memory_type_changed().

Right - it may suffice to have one more extensive comment, but _all_
involved parties will need to have at least a cross reference such
that one can easily find all pieces of code needing to be kept in sync.

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-26  7:33     ` Jan Beulich
@ 2022-09-26 14:50       ` Roger Pau Monné
  2022-09-26 15:25         ` Roger Pau Monné
  0 siblings, 1 reply; 12+ messages in thread
From: Roger Pau Monné @ 2022-09-26 14:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On Mon, Sep 26, 2022 at 09:33:10AM +0200, Jan Beulich wrote:
> On 23.09.2022 10:35, Roger Pau Monné wrote:
> > On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote:
> >> On 22.09.2022 18:05, Roger Pau Monne wrote:
> >>> memory_type_changed() is currently only implemented for Intel EPT, and
> >>> results in the invalidation of EMT attributes on all the entries in
> >>> the EPT page tables.  Such invalidation causes EPT_MISCONFIG vmexits
> >>> when the guest tries to access any gfns for the first time, which
> >>> results in the recalculation of the EMT for the accessed page.  The
> >>> vmexit and the recalculations are expensive, and as such should be
> >>> avoided when possible.
> >>>
> >>> Remove the call to memory_type_changed() from
> >>> XEN_DOMCTL_memory_mapping: there are no modifications of the
> >>> iomem_caps ranges anymore that could alter the return of
> >>> cache_flush_permitted() from that domctl.
> >>
> >> I certainly agree - this was an oversight when the two aspects were
> >> split. One might argue this is a (performance) fix to the earlier
> >> commit, and hence might want to go on its own with a Fixes: tag.
> > 
> > Was wondering myself, didn't add the 'Fixes:' tag because of the extra
> > content.
> > 
> >>> Calls to memory_type_changed() resulting from changes to the domain
> >>> iomem_caps or ioport_caps ranges are only relevant for EMT
> >>> calculations if the IOMMU is not enabled, and the call has resulted in
> >>> a change to the return value of cache_flush_permitted().
> >>
> >> I'm less certain here: These shouldn't be frequent operations, so
> >> their impact on the guest should be limited?
> > 
> > Citrix has an use case for vGPU where IOMMU regions are added and
> > removed during guest runtime.  Such functionality makes uses of both
> > XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping.
> 
> I see. Maybe this would want saying in the description, to express
> that there's little expected benefit for upstream.

I guess any OS that moves BARs around will also trigger such code
paths, but that might not be very common.  I can add something to the
description.

> > While the memory_type_changed() call in XEN_DOMCTL_memory_mapping
> > seems to be the most problematic performance wise, I though it was
> > nice to try to avoid memory_type_changed() as much as possible, as
> > those tax the guest quite heavily with EPT_MISCONFIG faults and the
> > recalculation logic.
> 
> Trying to avoid this is certainly desirable, I agree. But we need
> to make sure that it's not "easy" to break things by touching one
> place but leaving others alone which really would need keeping in
> sync. Therefore I'd see such added logic as acceptable only if the
> risk towards future changes is sufficiently low.
> 
> >> And if we were to restrict the calls, I think we need to clearly
> >> tie together the various places which need updating together in
> >> case e.g. the condition in epte_get_entry_emt() is changed.
> >> Minimally by way of comments, but maybe by way of a small helper
> >> function (for which I can't seem to be able to think of a good
> >> name) sitting next to epte_get_entry_emt().
> > 
> > Such helper function is also kind of problematic, as it would have to
> > live in p2m-ept.c but be used in domctl.c and x86/domctl.c?  It would
> > have to go through the p2m_domain indirection structure.
> 
> It would need abstraction at the arch level as well as for !HVM configs
> on x86. I'm not sure the indirection layer would actually be needed, as
> the contents of the function - despite wanting placing in p2m-ept.c -
> isn't really vendor dependent. (If AMD/SVM gained a need for a similar
> helper, things would nee re-evaluating.)

Maybe it would be better to add the calls to memory_type_changed()
directly in iomem_{permit,deny}_access() and
ioports_{permit,deny}_access itself?

That would also allow to remove the noop Arm memory_type_changed()
halper.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-26 14:50       ` Roger Pau Monné
@ 2022-09-26 15:25         ` Roger Pau Monné
  2022-09-26 15:36           ` Jan Beulich
  0 siblings, 1 reply; 12+ messages in thread
From: Roger Pau Monné @ 2022-09-26 15:25 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On Mon, Sep 26, 2022 at 04:50:22PM +0200, Roger Pau Monné wrote:
> On Mon, Sep 26, 2022 at 09:33:10AM +0200, Jan Beulich wrote:
> > On 23.09.2022 10:35, Roger Pau Monné wrote:
> > > On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote:
> > >> On 22.09.2022 18:05, Roger Pau Monne wrote:
> > >>> memory_type_changed() is currently only implemented for Intel EPT, and
> > >>> results in the invalidation of EMT attributes on all the entries in
> > >>> the EPT page tables.  Such invalidation causes EPT_MISCONFIG vmexits
> > >>> when the guest tries to access any gfns for the first time, which
> > >>> results in the recalculation of the EMT for the accessed page.  The
> > >>> vmexit and the recalculations are expensive, and as such should be
> > >>> avoided when possible.
> > >>>
> > >>> Remove the call to memory_type_changed() from
> > >>> XEN_DOMCTL_memory_mapping: there are no modifications of the
> > >>> iomem_caps ranges anymore that could alter the return of
> > >>> cache_flush_permitted() from that domctl.
> > >>
> > >> I certainly agree - this was an oversight when the two aspects were
> > >> split. One might argue this is a (performance) fix to the earlier
> > >> commit, and hence might want to go on its own with a Fixes: tag.
> > > 
> > > Was wondering myself, didn't add the 'Fixes:' tag because of the extra
> > > content.
> > > 
> > >>> Calls to memory_type_changed() resulting from changes to the domain
> > >>> iomem_caps or ioport_caps ranges are only relevant for EMT
> > >>> calculations if the IOMMU is not enabled, and the call has resulted in
> > >>> a change to the return value of cache_flush_permitted().
> > >>
> > >> I'm less certain here: These shouldn't be frequent operations, so
> > >> their impact on the guest should be limited?
> > > 
> > > Citrix has an use case for vGPU where IOMMU regions are added and
> > > removed during guest runtime.  Such functionality makes uses of both
> > > XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping.
> > 
> > I see. Maybe this would want saying in the description, to express
> > that there's little expected benefit for upstream.
> 
> I guess any OS that moves BARs around will also trigger such code
> paths, but that might not be very common.  I can add something to the
> description.
> 
> > > While the memory_type_changed() call in XEN_DOMCTL_memory_mapping
> > > seems to be the most problematic performance wise, I though it was
> > > nice to try to avoid memory_type_changed() as much as possible, as
> > > those tax the guest quite heavily with EPT_MISCONFIG faults and the
> > > recalculation logic.
> > 
> > Trying to avoid this is certainly desirable, I agree. But we need
> > to make sure that it's not "easy" to break things by touching one
> > place but leaving others alone which really would need keeping in
> > sync. Therefore I'd see such added logic as acceptable only if the
> > risk towards future changes is sufficiently low.
> > 
> > >> And if we were to restrict the calls, I think we need to clearly
> > >> tie together the various places which need updating together in
> > >> case e.g. the condition in epte_get_entry_emt() is changed.
> > >> Minimally by way of comments, but maybe by way of a small helper
> > >> function (for which I can't seem to be able to think of a good
> > >> name) sitting next to epte_get_entry_emt().
> > > 
> > > Such helper function is also kind of problematic, as it would have to
> > > live in p2m-ept.c but be used in domctl.c and x86/domctl.c?  It would
> > > have to go through the p2m_domain indirection structure.
> > 
> > It would need abstraction at the arch level as well as for !HVM configs
> > on x86. I'm not sure the indirection layer would actually be needed, as
> > the contents of the function - despite wanting placing in p2m-ept.c -
> > isn't really vendor dependent. (If AMD/SVM gained a need for a similar
> > helper, things would nee re-evaluating.)
> 
> Maybe it would be better to add the calls to memory_type_changed()
> directly in iomem_{permit,deny}_access() and
> ioports_{permit,deny}_access itself?
> 
> That would also allow to remove the noop Arm memory_type_changed()
> halper.

Correction: the Arm memory_type_changed() needs to stay, as
iomem_{permit,deny}_access() is common code.

Regards, Roger.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-26 15:25         ` Roger Pau Monné
@ 2022-09-26 15:36           ` Jan Beulich
  2022-09-26 15:58             ` Roger Pau Monné
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2022-09-26 15:36 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On 26.09.2022 17:25, Roger Pau Monné wrote:
> On Mon, Sep 26, 2022 at 04:50:22PM +0200, Roger Pau Monné wrote:
>> On Mon, Sep 26, 2022 at 09:33:10AM +0200, Jan Beulich wrote:
>>> On 23.09.2022 10:35, Roger Pau Monné wrote:
>>>> On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote:
>>>>> On 22.09.2022 18:05, Roger Pau Monne wrote:
>>>>> And if we were to restrict the calls, I think we need to clearly
>>>>> tie together the various places which need updating together in
>>>>> case e.g. the condition in epte_get_entry_emt() is changed.
>>>>> Minimally by way of comments, but maybe by way of a small helper
>>>>> function (for which I can't seem to be able to think of a good
>>>>> name) sitting next to epte_get_entry_emt().
>>>>
>>>> Such helper function is also kind of problematic, as it would have to
>>>> live in p2m-ept.c but be used in domctl.c and x86/domctl.c?  It would
>>>> have to go through the p2m_domain indirection structure.
>>>
>>> It would need abstraction at the arch level as well as for !HVM configs
>>> on x86. I'm not sure the indirection layer would actually be needed, as
>>> the contents of the function - despite wanting placing in p2m-ept.c -
>>> isn't really vendor dependent. (If AMD/SVM gained a need for a similar
>>> helper, things would nee re-evaluating.)
>>
>> Maybe it would be better to add the calls to memory_type_changed()
>> directly in iomem_{permit,deny}_access() and
>> ioports_{permit,deny}_access itself?

I'm of two minds - on one hand that would nicely take the call "out of
sight", but otoh this would feel like a layering violation. Yet then
maybe it's a layering violation no matter where that call lives.

>> That would also allow to remove the noop Arm memory_type_changed()
>> halper.
> 
> Correction: the Arm memory_type_changed() needs to stay, as
> iomem_{permit,deny}_access() is common code.

Right, or we'd need some other arch abstraction. (I wonder whether
long term Arm can actually get away without this. Even on the AMD side
of x86 I don't think it's quite right that adding/removing of MMIO
ranges has no effect on the memory type of accesses.)

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-26 15:36           ` Jan Beulich
@ 2022-09-26 15:58             ` Roger Pau Monné
  2022-09-27  6:35               ` Jan Beulich
  0 siblings, 1 reply; 12+ messages in thread
From: Roger Pau Monné @ 2022-09-26 15:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On Mon, Sep 26, 2022 at 05:36:41PM +0200, Jan Beulich wrote:
> On 26.09.2022 17:25, Roger Pau Monné wrote:
> > On Mon, Sep 26, 2022 at 04:50:22PM +0200, Roger Pau Monné wrote:
> >> On Mon, Sep 26, 2022 at 09:33:10AM +0200, Jan Beulich wrote:
> >>> On 23.09.2022 10:35, Roger Pau Monné wrote:
> >>>> On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote:
> >>>>> On 22.09.2022 18:05, Roger Pau Monne wrote:
> >>>>> And if we were to restrict the calls, I think we need to clearly
> >>>>> tie together the various places which need updating together in
> >>>>> case e.g. the condition in epte_get_entry_emt() is changed.
> >>>>> Minimally by way of comments, but maybe by way of a small helper
> >>>>> function (for which I can't seem to be able to think of a good
> >>>>> name) sitting next to epte_get_entry_emt().
> >>>>
> >>>> Such helper function is also kind of problematic, as it would have to
> >>>> live in p2m-ept.c but be used in domctl.c and x86/domctl.c?  It would
> >>>> have to go through the p2m_domain indirection structure.
> >>>
> >>> It would need abstraction at the arch level as well as for !HVM configs
> >>> on x86. I'm not sure the indirection layer would actually be needed, as
> >>> the contents of the function - despite wanting placing in p2m-ept.c -
> >>> isn't really vendor dependent. (If AMD/SVM gained a need for a similar
> >>> helper, things would nee re-evaluating.)
> >>
> >> Maybe it would be better to add the calls to memory_type_changed()
> >> directly in iomem_{permit,deny}_access() and
> >> ioports_{permit,deny}_access itself?
> 
> I'm of two minds - on one hand that would nicely take the call "out of
> sight", but otoh this would feel like a layering violation. Yet then
> maybe it's a layering violation no matter where that call lives.

Kind of, I think it's slightly better than having the callers take
care of calling memory_type_changed(), and prevents new users of
{iomem,ioports}_{permit,deny}_access() missing the calls to
memory_type_changed().

Let me post what I have with this approach.

> >> That would also allow to remove the noop Arm memory_type_changed()
> >> halper.
> > 
> > Correction: the Arm memory_type_changed() needs to stay, as
> > iomem_{permit,deny}_access() is common code.
> 
> Right, or we'd need some other arch abstraction. (I wonder whether
> long term Arm can actually get away without this. Even on the AMD side
> of x86 I don't think it's quite right that adding/removing of MMIO
> ranges has no effect on the memory type of accesses.)

IIRC there's no way for the hypervisor to infer cache attributes on
AMD SVM for NPT entries, but maybe I'm missing something.  Guest MTRRs
settings are completely ignored for AMD guests.  I'm not able ATM
however to find in the AMD PM how effective cache attributes are
calculated when using NPT however.  I would guess host MTRR + guest
PAT?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-22 16:05 [PATCH] x86/ept: limit calls to memory_type_changed() Roger Pau Monne
  2022-09-22 19:21 ` Jan Beulich
@ 2022-09-26 18:03 ` Andrew Cooper
  2022-09-27  9:33   ` Roger Pau Monné
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Cooper @ 2022-09-26 18:03 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Jan Beulich, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

On 22/09/2022 17:05, Roger Pau Monne wrote:
> memory_type_changed() is currently only implemented for Intel EPT, and
> results in the invalidation of EMT attributes on all the entries in
> the EPT page tables.  Such invalidation causes EPT_MISCONFIG vmexits
> when the guest tries to access any gfns for the first time, which
> results in the recalculation of the EMT for the accessed page.  The
> vmexit and the recalculations are expensive, and as such should be
> avoided when possible.
>
> Remove the call to memory_type_changed() from
> XEN_DOMCTL_memory_mapping: there are no modifications of the
> iomem_caps ranges anymore that could alter the return of
> cache_flush_permitted() from that domctl.
>
> Calls to memory_type_changed() resulting from changes to the domain
> iomem_caps or ioport_caps ranges are only relevant for EMT
> calculations if the IOMMU is not enabled, and the call has resulted in
> a change to the return value of cache_flush_permitted().

This, and the perf problem Citrix have found, is caused by a more
fundamental bug which I identified during XSA-402.

Xen is written with assumption that cacheability other than WB is
dependent on having devices.  While this is perhaps true of current
configurations available, it is a layering violation, and will cease
being true in order to support encrypted RAM (and by extension,
encrypted VMs).

At the moment, we know the IOMMU-ness of a domain right from the outset,
but the cacheability permits are dynamic, based on the non-emptyness of
the domain's device list, ioport list, and various others.

All the memory_type_changed() calls here are to cover the fact that the
original design was buggy by not having the cacheability-ness part of
domain create in the first place.

The appropriate fix, but definitely 4.18 work at this point, is to have
a new CDF flag which permits the use of non-WB cacheability.

For testing purposes alone, turning it on on an otherwise "plain VM" is
useful (its how I actually debugged XSA-402, and the only sane way to go
about investigating the MTRR per disasters for VGPU VMs[1]), but for
regular usecases, it wants cross-checking with the IOMMU flag (and
encrypted VM flag in the future), and for all dynamic list checks to
turn into a simple 'd->config & CDF_full_cacheability'.

This way, we delete all calls to memory_type_changed() which are trying
to cover the various dynamic lists becoming empty/non-empty, and we
remove several ordering-of-hypercalls bugs where non-cacheable mappings
can't actually be created on a VM declared to have an IOMMU until a
device has actually been assigned to start with.

~Andrew

[1] MTRR handling is also buggy with reduced cacheability, causing some
areas of RAM to be used UC; notably the grant table.  This manifests as
PV device perf being worse than qemu-emulated device perf, only when a
GPU is added to a VM[2].  Instead of fixing this properly, it was hacked
around by forcing IPAT=1 for Xenheap pages, which only "fixed" the
problem on Intel (AMD has no equivalent mechanism), and needs reverting
and fixing properly (i.e. get the vMTRR layout working correctly) to
support VMs with encrypted RAM.

[2] There's a second bug with memory_type_changed() in that it causes
dreadful system performance during VM migration, which is something to
do with the interaction of restoring vMTRRs for a VM that has a device
but isn't running yet.  This still needs investigating, and I suspect
it's got a similar root cause.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-26 15:58             ` Roger Pau Monné
@ 2022-09-27  6:35               ` Jan Beulich
  2022-09-27  8:40                 ` Roger Pau Monné
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2022-09-27  6:35 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On 26.09.2022 17:58, Roger Pau Monné wrote:
> On Mon, Sep 26, 2022 at 05:36:41PM +0200, Jan Beulich wrote:
>> On 26.09.2022 17:25, Roger Pau Monné wrote:
>>> Correction: the Arm memory_type_changed() needs to stay, as
>>> iomem_{permit,deny}_access() is common code.
>>
>> Right, or we'd need some other arch abstraction. (I wonder whether
>> long term Arm can actually get away without this. Even on the AMD side
>> of x86 I don't think it's quite right that adding/removing of MMIO
>> ranges has no effect on the memory type of accesses.)
> 
> IIRC there's no way for the hypervisor to infer cache attributes on
> AMD SVM for NPT entries, but maybe I'm missing something.  Guest MTRRs
> settings are completely ignored for AMD guests.

Right, as documented: "Note that there is no hardware support for guest
MTRRs; the VMM can simulate their effect by altering the memory types
in the nested page tables." That's something we imo should do, but which
I don't think we actually do (see p2m_type_to_flags()). We respect the
PAT bit when splitting large pages, but I don't think we ever set the
bit when making new / updated entries.

>  I'm not able ATM
> however to find in the AMD PM how effective cache attributes are
> calculated when using NPT however.  I would guess host MTRR + guest
> PAT?

First guest and host PAT are combined, then the result is combined with
(host) MTRR. See the tables in the "Nested Paging" sub-section "Combining
Memory Types, MTRRs". Of course things are quite a bit more limited (but
also simpler) in shadow mode.

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-27  6:35               ` Jan Beulich
@ 2022-09-27  8:40                 ` Roger Pau Monné
  0 siblings, 0 replies; 12+ messages in thread
From: Roger Pau Monné @ 2022-09-27  8:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, xen-devel

On Tue, Sep 27, 2022 at 08:35:20AM +0200, Jan Beulich wrote:
> On 26.09.2022 17:58, Roger Pau Monné wrote:
> > On Mon, Sep 26, 2022 at 05:36:41PM +0200, Jan Beulich wrote:
> >> On 26.09.2022 17:25, Roger Pau Monné wrote:
> >>> Correction: the Arm memory_type_changed() needs to stay, as
> >>> iomem_{permit,deny}_access() is common code.
> >>
> >> Right, or we'd need some other arch abstraction. (I wonder whether
> >> long term Arm can actually get away without this. Even on the AMD side
> >> of x86 I don't think it's quite right that adding/removing of MMIO
> >> ranges has no effect on the memory type of accesses.)
> > 
> > IIRC there's no way for the hypervisor to infer cache attributes on
> > AMD SVM for NPT entries, but maybe I'm missing something.  Guest MTRRs
> > settings are completely ignored for AMD guests.
> 
> Right, as documented: "Note that there is no hardware support for guest
> MTRRs; the VMM can simulate their effect by altering the memory types
> in the nested page tables." That's something we imo should do, but which
> I don't think we actually do (see p2m_type_to_flags()). We respect the
> PAT bit when splitting large pages, but I don't think we ever set the
> bit when making new / updated entries.
> 
> >  I'm not able ATM
> > however to find in the AMD PM how effective cache attributes are
> > calculated when using NPT however.  I would guess host MTRR + guest
> > PAT?
> 
> First guest and host PAT are combined, then the result is combined with
> (host) MTRR. See the tables in the "Nested Paging" sub-section "Combining
> Memory Types, MTRRs". Of course things are quite a bit more limited (but
> also simpler) in shadow mode.

Thanks, so we could indeed do something similar as to what we do for
Intel and set a cache attribute in the nested page tables, at which
point we would need epte_get_entry_emt() to be not EPT specific
anymore.

I've created:

https://gitlab.com/xen-project/xen/-/issues/88

To have some reminder of this pending work, or else I would forget.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86/ept: limit calls to memory_type_changed()
  2022-09-26 18:03 ` Andrew Cooper
@ 2022-09-27  9:33   ` Roger Pau Monné
  0 siblings, 0 replies; 12+ messages in thread
From: Roger Pau Monné @ 2022-09-27  9:33 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, Jan Beulich, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini

On Mon, Sep 26, 2022 at 06:03:24PM +0000, Andrew Cooper wrote:
> On 22/09/2022 17:05, Roger Pau Monne wrote:
> > memory_type_changed() is currently only implemented for Intel EPT, and
> > results in the invalidation of EMT attributes on all the entries in
> > the EPT page tables.  Such invalidation causes EPT_MISCONFIG vmexits
> > when the guest tries to access any gfns for the first time, which
> > results in the recalculation of the EMT for the accessed page.  The
> > vmexit and the recalculations are expensive, and as such should be
> > avoided when possible.
> >
> > Remove the call to memory_type_changed() from
> > XEN_DOMCTL_memory_mapping: there are no modifications of the
> > iomem_caps ranges anymore that could alter the return of
> > cache_flush_permitted() from that domctl.
> >
> > Calls to memory_type_changed() resulting from changes to the domain
> > iomem_caps or ioport_caps ranges are only relevant for EMT
> > calculations if the IOMMU is not enabled, and the call has resulted in
> > a change to the return value of cache_flush_permitted().
> 
> This, and the perf problem Citrix have found, is caused by a more
> fundamental bug which I identified during XSA-402.
> 
> Xen is written with assumption that cacheability other than WB is
> dependent on having devices.  While this is perhaps true of current
> configurations available, it is a layering violation, and will cease
> being true in order to support encrypted RAM (and by extension,
> encrypted VMs).

I assumed this was done as a performance improvement (or hack).

> At the moment, we know the IOMMU-ness of a domain right from the outset,
> but the cacheability permits are dynamic, based on the non-emptyness of
> the domain's device list, ioport list, and various others.

Well, as long as there's an IOMMU assigned cacheability will be
calculated taking into account the guest MTRR values, and won't
be forced to WB, regardless of the emptiness of the device or IO
ports/memory capability lists.

Just setting `passthrough=1` on the guest config file without any
devices actually passed through should prevent the forcing WB.

The use case of allowing io memory or ports to be assigned to an HVM
guest without an IOMMU has always confused me, but I guess if it's
there it's because it's used by someone.

> All the memory_type_changed() calls here are to cover the fact that the
> original design was buggy by not having the cacheability-ness part of
> domain create in the first place.
> 
> The appropriate fix, but definitely 4.18 work at this point, is to have
> a new CDF flag which permits the use of non-WB cacheability.
> 
> For testing purposes alone, turning it on on an otherwise "plain VM" is
> useful (its how I actually debugged XSA-402, and the only sane way to go
> about investigating the MTRR per disasters for VGPU VMs[1]), 

Wasn't it enough to set `passthrough=1` in order to enable
cacheability for debugging purposes?  (not that I oppose to adding the
flag, just curious why enabling the IOMMU wasn't enough).

> but for
> regular usecases, it wants cross-checking with the IOMMU flag (and
> encrypted VM flag in the future), and for all dynamic list checks to
> turn into a simple 'd->config & CDF_full_cacheability'.
> 
> This way, we delete all calls to memory_type_changed() which are trying
> to cover the various dynamic lists becoming empty/non-empty, and we
> remove several ordering-of-hypercalls bugs where non-cacheable mappings
> can't actually be created on a VM declared to have an IOMMU until a
> device has actually been assigned to start with.

It should be possible with current code to create non-cacheable
mappings on a VM as long as the IOMMU is enabled, regardless of
whether no devices are assigned to the VM.

> ~Andrew
> 
> [1] MTRR handling is also buggy with reduced cacheability, causing some
> areas of RAM to be used UC; notably the grant table.  This manifests as
> PV device perf being worse than qemu-emulated device perf, only when a
> GPU is added to a VM[2].  Instead of fixing this properly, it was hacked
> around by forcing IPAT=1 for Xenheap pages, which only "fixed" the
> problem on Intel (AMD has no equivalent mechanism), and needs reverting
> and fixing properly (i.e. get the vMTRR layout working correctly) to
> support VMs with encrypted RAM.

My understanding of the original problem was slightly different: we
place the grant table in a BAR region of the xenpci device, and hence
the gMTRRs are set to UC by the guest.  The workaround in Xen is to
cope with existing Windows guest drivers not setting the gMTRR values
for the grant table frames to WB.

Even if we calculate all the cache attributes correctly we would still
need the 'hack'.

> 
> [2] There's a second bug with memory_type_changed() in that it causes
> dreadful system performance during VM migration, which is something to
> do with the interaction of restoring vMTRRs for a VM that has a device
> but isn't running yet.  This still needs investigating, and I suspect
> it's got a similar root cause.

XenServer seems to have a custom patch for this:

https://github.com/xenserver/xen.pg/blob/XS-8.3.x/patches/0001-x86-HVM-Avoid-cache-flush-operations-during-hvm_load.patch

But we could likely avoid the flush completely if the VM hasn't been
started yet.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-09-27  9:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-22 16:05 [PATCH] x86/ept: limit calls to memory_type_changed() Roger Pau Monne
2022-09-22 19:21 ` Jan Beulich
2022-09-23  8:35   ` Roger Pau Monné
2022-09-26  7:33     ` Jan Beulich
2022-09-26 14:50       ` Roger Pau Monné
2022-09-26 15:25         ` Roger Pau Monné
2022-09-26 15:36           ` Jan Beulich
2022-09-26 15:58             ` Roger Pau Monné
2022-09-27  6:35               ` Jan Beulich
2022-09-27  8:40                 ` Roger Pau Monné
2022-09-26 18:03 ` Andrew Cooper
2022-09-27  9:33   ` Roger Pau Monné

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).