xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Design session "grant v3"
@ 2022-09-22 13:42 Marek Marczykowski-Górecki
  2022-09-22 18:43 ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Marek Marczykowski-Górecki @ 2022-09-22 13:42 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2311 bytes --]

Jürgen: today two grants formats, v1 supports only up to 16TB addresses
        v2 solves 16TB issue, introduces several more features^Wbugs
        v2 is 16 bytes per entry, v1 is 8 bytes per entry, v2 more complicated interface to the hypervisor
        virtio could use per-device grant table, currently virtio iommu device, slow interface
        v3 could be a grants tree (like iommu page tables), not flat array, separate trees for each grantee
        could support sharing large pages too
        easier to have more grants, continuous grant numbers etc
        two options to distingush trees (from HV PoV):
        - sharing guest ensure distinct grant ids between (multiple) trees
        - hv tells guest index under tree got registered
        v3 can be addition to v1/v2, old used for simpler cases where tree is an overkill
        hypervisor needs extra memory to keep refcounts - resource allocation discussion
        hv could have TLB to speedup mapping
        issue with v1/v2 - granter cannot revoke pages from uncooperating backend
        tree could have special page for revoking grants (redirect to that page)
        special domids, local to the guest, toolstack restaring backend could request to keep the same virtual domid
Marek:  that requires stateless (or recoverable) protocol, reusing domid currently causes issues
Andrei: how revoking could work
Jürgen: there needs to be hypercall, replacing and invalidating mapping (scan page tables?), possibly adjusting IOMMU etc; may fail, problematic for PV

Yann:   can backend refuse revoking?
Jürgen: it shouldn't be this way, but revoke could be controlled by feature flag; revoke could pass scratch page per revoke call (more flexible control)

Marek:  what about unmap notification?
Jürgen: revoke could even be async; ring page for unmap notifications

Marek:  downgrading mappings (rw -> ro)
Jürgen: must be careful, to not allow crashing backend

Jürgen: we should consider interface to mapping large pages ("map this area as a large page if backend shared it as large page")

Edwin:  what happens when shattering that large page?
Jürgen: on live migration pages are rebuilt anyway, can reconstruct large pages


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design session "grant v3"
  2022-09-22 13:42 Design session "grant v3" Marek Marczykowski-Górecki
@ 2022-09-22 18:43 ` Jan Beulich
  2022-09-23  9:31   ` Juergen Gross
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2022-09-22 18:43 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Marek Marczykowski-Górecki, xen-devel

On 22.09.2022 15:42, Marek Marczykowski-Górecki wrote:
> Jürgen: today two grants formats, v1 supports only up to 16TB addresses
>         v2 solves 16TB issue, introduces several more features^Wbugs
>         v2 is 16 bytes per entry, v1 is 8 bytes per entry, v2 more complicated interface to the hypervisor
>         virtio could use per-device grant table, currently virtio iommu device, slow interface
>         v3 could be a grants tree (like iommu page tables), not flat array, separate trees for each grantee
>         could support sharing large pages too
>         easier to have more grants, continuous grant numbers etc
>         two options to distingush trees (from HV PoV):
>         - sharing guest ensure distinct grant ids between (multiple) trees
>         - hv tells guest index under tree got registered
>         v3 can be addition to v1/v2, old used for simpler cases where tree is an overkill
>         hypervisor needs extra memory to keep refcounts - resource allocation discussion

How would refcounts be different from today? Perhaps I don't have a clear
enough picture yet how you envision the tree-like structure(s) to be used.

>         hv could have TLB to speedup mapping
>         issue with v1/v2 - granter cannot revoke pages from uncooperating backend
>         tree could have special page for revoking grants (redirect to that page)
>         special domids, local to the guest, toolstack restaring backend could request to keep the same virtual domid
> Marek:  that requires stateless (or recoverable) protocol, reusing domid currently causes issues
> Andrei: how revoking could work
> Jürgen: there needs to be hypercall, replacing and invalidating mapping (scan page tables?), possibly adjusting IOMMU etc; may fail, problematic for PV

Why would this be problematic for PV only? In principle any
number of mappings of a grant are possible also for PVH/HVM. So
all of them would need finding and replacing. Because of the
multiple mappings, the M2P is of no use here.

While thinking about this I started wondering in how far things
are actually working correctly right now for backends in PVH/HVM:
Any mapping of a grant is handed to p2m_add_page(), which insists
on there being exactly one mapping of any particular MFN, unless
the page is a foreign one. But how does that allow a domain to
map its own grants, e.g. when block-attaching a device locally in
Dom0? Afaict the grant-map would succeed, but the page would be
unmapped from its original GFN.

> Yann:   can backend refuse revoking?
> Jürgen: it shouldn't be this way, but revoke could be controlled by feature flag; revoke could pass scratch page per revoke call (more flexible control)

A single scratch page comes with the risk of data corruption, as all
I/O would be directed there. A sink page (for memory writes) would
likely be okay, but device writes (memory reads) can't be done from
a surrogate page.

> Marek:  what about unmap notification?
> Jürgen: revoke could even be async; ring page for unmap notifications
> 
> Marek:  downgrading mappings (rw -> ro)
> Jürgen: must be careful, to not allow crashing backend
> 
> Jürgen: we should consider interface to mapping large pages ("map this area as a large page if backend shared it as large page")

s/backend/frontend/ I guess?

> Edwin:  what happens when shattering that large page?
> Jürgen: on live migration pages are rebuilt anyway, can reconstruct large pages

If only we did already rebuild large pages ...

Jan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design session "grant v3"
  2022-09-22 18:43 ` Jan Beulich
@ 2022-09-23  9:31   ` Juergen Gross
  2022-09-26  6:57     ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Juergen Gross @ 2022-09-23  9:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Marek Marczykowski-Górecki, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 4969 bytes --]

On 22.09.22 20:43, Jan Beulich wrote:
> On 22.09.2022 15:42, Marek Marczykowski-Górecki wrote:
>> Jürgen: today two grants formats, v1 supports only up to 16TB addresses
>>          v2 solves 16TB issue, introduces several more features^Wbugs
>>          v2 is 16 bytes per entry, v1 is 8 bytes per entry, v2 more complicated interface to the hypervisor
>>          virtio could use per-device grant table, currently virtio iommu device, slow interface
>>          v3 could be a grants tree (like iommu page tables), not flat array, separate trees for each grantee
>>          could support sharing large pages too
>>          easier to have more grants, continuous grant numbers etc
>>          two options to distingush trees (from HV PoV):
>>          - sharing guest ensure distinct grant ids between (multiple) trees
>>          - hv tells guest index under tree got registered
>>          v3 can be addition to v1/v2, old used for simpler cases where tree is an overkill
>>          hypervisor needs extra memory to keep refcounts - resource allocation discussion
> 
> How would refcounts be different from today? Perhaps I don't have a clear
> enough picture yet how you envision the tree-like structure(s) to be used.

What was meant here are the additional resources the hypervisor will need for
higher grant counts of a guest. With the tree approach the number of grant
frames will be basically controlled by the guest and imposing a limit like
today wouldn't work very well (especially with the current default of only
64 grant frames).

> 
>>          hv could have TLB to speedup mapping
>>          issue with v1/v2 - granter cannot revoke pages from uncooperating backend
>>          tree could have special page for revoking grants (redirect to that page)
>>          special domids, local to the guest, toolstack restaring backend could request to keep the same virtual domid
>> Marek:  that requires stateless (or recoverable) protocol, reusing domid currently causes issues
>> Andrei: how revoking could work
>> Jürgen: there needs to be hypercall, replacing and invalidating mapping (scan page tables?), possibly adjusting IOMMU etc; may fail, problematic for PV
> 
> Why would this be problematic for PV only? In principle any
> number of mappings of a grant are possible also for PVH/HVM. So
> all of them would need finding and replacing. Because of the
> multiple mappings, the M2P is of no use here.

It is an additional layer in the PV case: even when mapping a foreign
page to only a single local PFN there could be multiple PTEs referencing
it.

I didn't think of the problem doing multiple mappings of the same grant.
I will look into that.

> While thinking about this I started wondering in how far things
> are actually working correctly right now for backends in PVH/HVM:
> Any mapping of a grant is handed to p2m_add_page(), which insists
> on there being exactly one mapping of any particular MFN, unless
> the page is a foreign one. But how does that allow a domain to
> map its own grants, e.g. when block-attaching a device locally in
> Dom0? Afaict the grant-map would succeed, but the page would be
> unmapped from its original GFN.
> 
>> Yann:   can backend refuse revoking?
>> Jürgen: it shouldn't be this way, but revoke could be controlled by feature flag; revoke could pass scratch page per revoke call (more flexible control)
> 
> A single scratch page comes with the risk of data corruption, as all
> I/O would be directed there. A sink page (for memory writes) would
> likely be okay, but device writes (memory reads) can't be done from
> a surrogate page.

I don't see that problem.

In case the grant is revoked due to a malicious/buggy backend, you can't
trust the I/O data anyway.

And in case the frontend is revoking the grant because the frontend is
malicious, this isn't an issue either.

> 
>> Marek:  what about unmap notification?
>> Jürgen: revoke could even be async; ring page for unmap notifications
>>
>> Marek:  downgrading mappings (rw -> ro)
>> Jürgen: must be careful, to not allow crashing backend
>>
>> Jürgen: we should consider interface to mapping large pages ("map this area as a large page if backend shared it as large page")
> 
> s/backend/frontend/ I guess?

Yes.

But large pages have another downside: The backend needs to know it is a large
page, otherwise it might get confused. So while this sounds like a nice idea, it
is cumbersome in practice. But maybe someone is coming up with a nice idea how
to solve that.

> 
>> Edwin:  what happens when shattering that large page?
>> Jürgen: on live migration pages are rebuilt anyway, can reconstruct large pages
> 
> If only we did already rebuild large pages ...

Indeed. But OTOH shattering shouldn't be a problem at least for PVH/HVM guests,
as we are speaking of gfns here. And PV guests don't have large pages anyway.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design session "grant v3"
  2022-09-23  9:31   ` Juergen Gross
@ 2022-09-26  6:57     ` Jan Beulich
  2022-09-26  7:04       ` Juergen Gross
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2022-09-26  6:57 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Marek Marczykowski-Górecki, xen-devel

On 23.09.2022 11:31, Juergen Gross wrote:
> On 22.09.22 20:43, Jan Beulich wrote:
>> On 22.09.2022 15:42, Marek Marczykowski-Górecki wrote:
>>> Yann:   can backend refuse revoking?
>>> Jürgen: it shouldn't be this way, but revoke could be controlled by feature flag; revoke could pass scratch page per revoke call (more flexible control)
>>
>> A single scratch page comes with the risk of data corruption, as all
>> I/O would be directed there. A sink page (for memory writes) would
>> likely be okay, but device writes (memory reads) can't be done from
>> a surrogate page.
> 
> I don't see that problem.
> 
> In case the grant is revoked due to a malicious/buggy backend, you can't
> trust the I/O data anyway.

I agree for the malicious case, but I'm less certain when is come to
buggy backends. Some bugs (like not unmapping a grant) aren't putting
the data at risk.

>>> Jürgen: we should consider interface to mapping large pages ("map this area as a large page if backend shared it as large page")
>>
>> s/backend/frontend/ I guess?
> 
> Yes.
> 
> But large pages have another downside: The backend needs to know it is a large
> page, otherwise it might get confused. So while this sounds like a nice idea, it
> is cumbersome in practice. But maybe someone is coming up with a nice idea how
> to solve that.

Couldn't that simply be a new GTF_superpage flag, with the size
encoded along the lines of AMD IOMMUs encode superpages (setting all
but the top-most of the bits not used for the actual frame address)
in the address part of the entry?

Jan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design session "grant v3"
  2022-09-26  6:57     ` Jan Beulich
@ 2022-09-26  7:04       ` Juergen Gross
  2022-09-26  7:23         ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Juergen Gross @ 2022-09-26  7:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Marek Marczykowski-Górecki, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 2048 bytes --]

On 26.09.22 08:57, Jan Beulich wrote:
> On 23.09.2022 11:31, Juergen Gross wrote:
>> On 22.09.22 20:43, Jan Beulich wrote:
>>> On 22.09.2022 15:42, Marek Marczykowski-Górecki wrote:
>>>> Yann:   can backend refuse revoking?
>>>> Jürgen: it shouldn't be this way, but revoke could be controlled by feature flag; revoke could pass scratch page per revoke call (more flexible control)
>>>
>>> A single scratch page comes with the risk of data corruption, as all
>>> I/O would be directed there. A sink page (for memory writes) would
>>> likely be okay, but device writes (memory reads) can't be done from
>>> a surrogate page.
>>
>> I don't see that problem.
>>
>> In case the grant is revoked due to a malicious/buggy backend, you can't
>> trust the I/O data anyway.
> 
> I agree for the malicious case, but I'm less certain when is come to
> buggy backends. Some bugs (like not unmapping a grant) aren't putting
> the data at risk.

In case the data page can't be used for anything else, what would be the
point of revoking the grant? The page would leak in both cases (revoking
or not).

> 
>>>> Jürgen: we should consider interface to mapping large pages ("map this area as a large page if backend shared it as large page")
>>>
>>> s/backend/frontend/ I guess?
>>
>> Yes.
>>
>> But large pages have another downside: The backend needs to know it is a large
>> page, otherwise it might get confused. So while this sounds like a nice idea, it
>> is cumbersome in practice. But maybe someone is coming up with a nice idea how
>> to solve that.
> 
> Couldn't that simply be a new GTF_superpage flag, with the size
> encoded along the lines of AMD IOMMUs encode superpages (setting all
> but the top-most of the bits not used for the actual frame address)
> in the address part of the entry?

Of course that would be possible, but using the feature would be limited
to backends having been modified to test that new flag. In the end both
sides would need to negotiate the feature usability.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design session "grant v3"
  2022-09-26  7:04       ` Juergen Gross
@ 2022-09-26  7:23         ` Jan Beulich
  2022-09-26  9:19           ` Juergen Gross
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2022-09-26  7:23 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Marek Marczykowski-Górecki, xen-devel

On 26.09.2022 09:04, Juergen Gross wrote:
> On 26.09.22 08:57, Jan Beulich wrote:
>> On 23.09.2022 11:31, Juergen Gross wrote:
>>> On 22.09.22 20:43, Jan Beulich wrote:
>>>> On 22.09.2022 15:42, Marek Marczykowski-Górecki wrote:
>>>>> Yann:   can backend refuse revoking?
>>>>> Jürgen: it shouldn't be this way, but revoke could be controlled by feature flag; revoke could pass scratch page per revoke call (more flexible control)
>>>>
>>>> A single scratch page comes with the risk of data corruption, as all
>>>> I/O would be directed there. A sink page (for memory writes) would
>>>> likely be okay, but device writes (memory reads) can't be done from
>>>> a surrogate page.
>>>
>>> I don't see that problem.
>>>
>>> In case the grant is revoked due to a malicious/buggy backend, you can't
>>> trust the I/O data anyway.
>>
>> I agree for the malicious case, but I'm less certain when is come to
>> buggy backends. Some bugs (like not unmapping a grant) aren't putting
>> the data at risk.
> 
> In case the data page can't be used for anything else, what would be the
> point of revoking the grant? The page would leak in both cases (revoking
> or not).

Sure, but don't you agree it would be better for the guest to have a way
to cleanly shut down in case it notices a misbehaving backend, rather
than having its data corrupted in the process? Of course a guest won't
be able to tell malicious from buggy, but what to do in such a case
ought to be a guest policy, not behavior forced upon it from the outside.
Then again I guess "pass scratch page per revoke call" is meant to cover
that already, i.e. leaving it to the guest how to actually deal with a
failed revoke.

>>>>> Jürgen: we should consider interface to mapping large pages ("map this area as a large page if backend shared it as large page")
>>>>
>>>> s/backend/frontend/ I guess?
>>>
>>> Yes.
>>>
>>> But large pages have another downside: The backend needs to know it is a large
>>> page, otherwise it might get confused. So while this sounds like a nice idea, it
>>> is cumbersome in practice. But maybe someone is coming up with a nice idea how
>>> to solve that.
>>
>> Couldn't that simply be a new GTF_superpage flag, with the size
>> encoded along the lines of AMD IOMMUs encode superpages (setting all
>> but the top-most of the bits not used for the actual frame address)
>> in the address part of the entry?
> 
> Of course that would be possible, but using the feature would be limited
> to backends having been modified to test that new flag. In the end both
> sides would need to negotiate the feature usability.

Isn't it to be expected that this might need negotiating? Strictly speaking
it might not need to be: The backend's map request (for a sufficiently
large number of grants all in one go) could be checked for being all
contiguous in the applicably address spaces. That wouldn't require the
frontend to advertise anything. But an unaware frontend wouldn't be very
likely to produce suitable I/O requests in the first place, I suspect.

Jan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Design session "grant v3"
  2022-09-26  7:23         ` Jan Beulich
@ 2022-09-26  9:19           ` Juergen Gross
  0 siblings, 0 replies; 7+ messages in thread
From: Juergen Gross @ 2022-09-26  9:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Marek Marczykowski-Górecki, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 3434 bytes --]

On 26.09.22 09:23, Jan Beulich wrote:
> On 26.09.2022 09:04, Juergen Gross wrote:
>> On 26.09.22 08:57, Jan Beulich wrote:
>>> On 23.09.2022 11:31, Juergen Gross wrote:
>>>> On 22.09.22 20:43, Jan Beulich wrote:
>>>>> On 22.09.2022 15:42, Marek Marczykowski-Górecki wrote:
>>>>>> Yann:   can backend refuse revoking?
>>>>>> Jürgen: it shouldn't be this way, but revoke could be controlled by feature flag; revoke could pass scratch page per revoke call (more flexible control)
>>>>>
>>>>> A single scratch page comes with the risk of data corruption, as all
>>>>> I/O would be directed there. A sink page (for memory writes) would
>>>>> likely be okay, but device writes (memory reads) can't be done from
>>>>> a surrogate page.
>>>>
>>>> I don't see that problem.
>>>>
>>>> In case the grant is revoked due to a malicious/buggy backend, you can't
>>>> trust the I/O data anyway.
>>>
>>> I agree for the malicious case, but I'm less certain when is come to
>>> buggy backends. Some bugs (like not unmapping a grant) aren't putting
>>> the data at risk.
>>
>> In case the data page can't be used for anything else, what would be the
>> point of revoking the grant? The page would leak in both cases (revoking
>> or not).
> 
> Sure, but don't you agree it would be better for the guest to have a way
> to cleanly shut down in case it notices a misbehaving backend, rather
> than having its data corrupted in the process? Of course a guest won't
> be able to tell malicious from buggy, but what to do in such a case
> ought to be a guest policy, not behavior forced upon it from the outside.

It could (based on its policy) either revoke or not.

> Then again I guess "pass scratch page per revoke call" is meant to cover
> that already, i.e. leaving it to the guest how to actually deal with a
> failed revoke.

Correct.

> 
>>>>>> Jürgen: we should consider interface to mapping large pages ("map this area as a large page if backend shared it as large page")
>>>>>
>>>>> s/backend/frontend/ I guess?
>>>>
>>>> Yes.
>>>>
>>>> But large pages have another downside: The backend needs to know it is a large
>>>> page, otherwise it might get confused. So while this sounds like a nice idea, it
>>>> is cumbersome in practice. But maybe someone is coming up with a nice idea how
>>>> to solve that.
>>>
>>> Couldn't that simply be a new GTF_superpage flag, with the size
>>> encoded along the lines of AMD IOMMUs encode superpages (setting all
>>> but the top-most of the bits not used for the actual frame address)
>>> in the address part of the entry?
>>
>> Of course that would be possible, but using the feature would be limited
>> to backends having been modified to test that new flag. In the end both
>> sides would need to negotiate the feature usability.
> 
> Isn't it to be expected that this might need negotiating? Strictly speaking
> it might not need to be: The backend's map request (for a sufficiently
> large number of grants all in one go) could be checked for being all
> contiguous in the applicably address spaces. That wouldn't require the
> frontend to advertise anything. But an unaware frontend wouldn't be very
> likely to produce suitable I/O requests in the first place, I suspect.

Yes. And an unaware backend wouldn't be very likely to map 512 grants in one
go for making use of the large page without intending to do so.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-09-26  9:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-22 13:42 Design session "grant v3" Marek Marczykowski-Górecki
2022-09-22 18:43 ` Jan Beulich
2022-09-23  9:31   ` Juergen Gross
2022-09-26  6:57     ` Jan Beulich
2022-09-26  7:04       ` Juergen Gross
2022-09-26  7:23         ` Jan Beulich
2022-09-26  9:19           ` Juergen Gross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).