Proposal for physical address based hypercalls

* Proposal for physical address based hypercalls
@ 2022-09-28 10:38 Jan Beulich
  2022-09-28 10:58 ` Andrew Cooper
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jan Beulich @ 2022-09-28 10:38 UTC (permalink / raw)
  To: xen-devel

For quite some time we've been talking about replacing the present virtual
address based hypercall interface with one using physical addresses.  This is in
particular a prerequisite to being able to support guests with encrypted
memory, as for such guests we cannot perform the page table walks necessary to
translate virtual to (guest-)physical addresses.  But using (guest) physical
addresses is also expected to help performance of non-PV guests (i.e. all Arm
ones plus HVM/PVH on x86), because of the no longer necessary address
translation.

Clearly to be able to run existing guests, we need to continue to support the
present virtual address based interface.  Previously it was suggested to change
the model on a per-domain basis, perhaps by a domain creation control.  This
has two major shortcomings:
 - Entire guest OSes would need to switch over to the new model all in one go.
   This could be particularly problematic for in-guest interfaces like Linux'es
   privcmd driver, which is passed hypercall argument from user space.  Such
   necessarily use virtual addresses, and hence the kernel would need to learn
   of all hypercalls legitimately coming in, in order to translate the buffer
   addresses.  Reaching sufficient coverage there might take some time.
 - All base components within an individual guest instance which might run in
   succession (firmware, boot loader, kernel, kexec) would need to agree on the
   hypercall ABI to use.

As an alternative I'd like to propose the introduction of a bit (or multiple
ones, see below) augmenting the hypercall number, to control the flavor of the
buffers used for every individual hypercall.  This would likely involve the
introduction of a new hypercall page (or multiple ones if more than one bit is
to be used), to retain the present abstraction where it is the hypervisor which
actually fills these pages.  For multicalls the wrapping multicall itself would
be controlled independently of the constituent hypercalls.

A model involving just a single bit to indicate "flat" buffers has limitations
when it comes to large buffers passed to a hypercall.  Since in many cases
hypercalls (currently) allowing for rather large buffers wouldn't normally be
used with buffers significantly larger than a single page (several of the
mem-ops for example), special casing the (presumably) few hypercalls which have
an actual need for large buffers might be an option.

Another approach would be to build in a scatter/gather model for buffers right
away.  Jürgen suggests that the low two address bits could be used as a
"descriptor" here.  Alternatively, since buffer sizes should always be known,
using a multi-bit augmentation to the hypercall number could also be a viable
model, distinguishing between e.g. all-linear buffers, all-single-S/G-level
ones, and size-dependent selection of zero or more S/G levels.  This would
affect all buffers used by a single hypercall.  With the level of indirection
needed derivable from buffer size, in the last of the variants small buffers
could still have their addresses provided directly while only larger buffers
would be described by e.g. a list of GFNs or a list of (address,length) tuples,
using multiple levels if even that list would still end up large.

Of course any one of the models could be selected as the only one to use (in
addition to the existing virtual address based one), allowing to stick to a
single bit augmenting the hypercall number.

Note that a dynamic model (indirection levels derived from buffer size) would
be quite impactful, as the overall buffer size would need passing to the
copying helpers alongside the size of the data which actually is to be copied.

How to express S/G lists will want to take into account existing uses.  For
example, an array of (address,length) tuples would be quite inefficient to use
with operations like copy_from_guest_offset().  Perhaps this would want to be
an array of xen_ulong_t, with the first slot holding the offset into the first
page and all further slots holding GFNs (albeit that would still require two
[generally] discontiguous reads from the array for a single
copy_from_guest_offset()).  Otoh, since calling code will need changing anyway
to use this new model, we might also require that such indirectly specified
buffers are page-aligned.

Virtual addresses will continue to be used in certain places.  Such aren't
normally expressed via handles, e.g. callback or exception handling entry
points.

Jan

^ permalink raw reply	[flat|nested] 14+ messages in thread