All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] KVM API extensions for SVE
@ 2017-11-21 13:49 ` Dave Martin
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2017-11-21 13:49 UTC (permalink / raw)
  To: kvmarm; +Cc: Marc Zyngier, linux-arm-kernel

Hi all,

SVE adds some new registers, and their size depends on the hardware ando
on runtime sysreg settings.

Before coding something, I'd like to get people's views on my current
approach here.

--8<--

New vcpu feature flag:
/*
 * userspace can support regs up to at least 2048 bits in size via ioctl,
 * and will honour the size field in the reg iD
 */
#define KVM_ARM_VCPU_LARGE_REGS	4

Should we just error out of userspace fails to set this on a system that
supports SVE, or is that too brutal?  If we do treat that as an error,
then we can unconditionally enable SVE for the guest when the host
supports it -- which cuts down on unnecessary implementation complexity.

Alternatively, we would need logic to disable SVE if userspace is too
old, i.e., doesn't set this flag.  Then we might need to enforce that
the flag is set the same on every vcpu -- but from the kernel's PoV
it probably doesn't matter.


/*
 * For the SVE regs, we add some new reg IDs.
 * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
 * slices.  This is sufficient for only a single slice to be required
 * per register for SVE, but ensures expansion room in case future arch
 * versions grow the maximum size.
 */
#define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
#define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
		((n) << 5) | (i))
#define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
		(((n) + 32) << 5) | (i))
#define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
	 KVM_REG_ARM64_SVE_P(16, i)

For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j


Bits above the max vector length could be
 * don't care (or not copied at all) on read; ignored on write
 * zero on read; ignored on write
 * zero on read; must be zero on write

Bits between the current and max vector length are trickier to specify:
the "current" vector length for ioctl access is ill-defined, because
we would need to specify ordering dependencies between Zn/Pn/FFR access
and access to ZCR_EL1.

So, it may be simpler to expose the full maximum supported vector size
unconditionally through ioctl, and pack/unpack as necessary.

Currently, data is packed in the vcpu struct in a vector length
dependent format, since this seems optimal for low-level save/restore,
so there will be potential data loss / zero padding when converting.

This may cause some unexpected effects.  For example:

KVM_SET_ONE_REG(ZCR_EL1, 0) 
/* Guest's current vector length will be 128 bits when started */

KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */

KVM_RUN /* reg data packed down to 128-bit in vcpu struct */

KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */


Since the guest should be treated mostly as a black box, I'm not sure
how big a deal this is.  The guest _might_ have explicitly set those
bits to 0... who are we to say?

Can anyone think of a scenario where effects like this would matter?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-11-21 13:49 ` Dave Martin
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2017-11-21 13:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

SVE adds some new registers, and their size depends on the hardware ando
on runtime sysreg settings.

Before coding something, I'd like to get people's views on my current
approach here.

--8<--

New vcpu feature flag:
/*
 * userspace can support regs up to at least 2048 bits in size via ioctl,
 * and will honour the size field in the reg iD
 */
#define KVM_ARM_VCPU_LARGE_REGS	4

Should we just error out of userspace fails to set this on a system that
supports SVE, or is that too brutal?  If we do treat that as an error,
then we can unconditionally enable SVE for the guest when the host
supports it -- which cuts down on unnecessary implementation complexity.

Alternatively, we would need logic to disable SVE if userspace is too
old, i.e., doesn't set this flag.  Then we might need to enforce that
the flag is set the same on every vcpu -- but from the kernel's PoV
it probably doesn't matter.


/*
 * For the SVE regs, we add some new reg IDs.
 * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
 * slices.  This is sufficient for only a single slice to be required
 * per register for SVE, but ensures expansion room in case future arch
 * versions grow the maximum size.
 */
#define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
#define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
		((n) << 5) | (i))
#define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
		(((n) + 32) << 5) | (i))
#define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
	 KVM_REG_ARM64_SVE_P(16, i)

For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j


Bits above the max vector length could be
 * don't care (or not copied@all) on read; ignored on write
 * zero on read; ignored on write
 * zero on read; must be zero on write

Bits between the current and max vector length are trickier to specify:
the "current" vector length for ioctl access is ill-defined, because
we would need to specify ordering dependencies between Zn/Pn/FFR access
and access to ZCR_EL1.

So, it may be simpler to expose the full maximum supported vector size
unconditionally through ioctl, and pack/unpack as necessary.

Currently, data is packed in the vcpu struct in a vector length
dependent format, since this seems optimal for low-level save/restore,
so there will be potential data loss / zero padding when converting.

This may cause some unexpected effects.  For example:

KVM_SET_ONE_REG(ZCR_EL1, 0) 
/* Guest's current vector length will be 128 bits when started */

KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */

KVM_RUN /* reg data packed down to 128-bit in vcpu struct */

KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */


Since the guest should be treated mostly as a black box, I'm not sure
how big a deal this is.  The guest _might_ have explicitly set those
bits to 0... who are we to say?

Can anyone think of a scenario where effects like this would matter?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] KVM API extensions for SVE
  2017-11-21 13:49 ` Dave Martin
@ 2017-11-22 19:52   ` Christoffer Dall
  -1 siblings, 0 replies; 18+ messages in thread
From: Christoffer Dall @ 2017-11-22 19:52 UTC (permalink / raw)
  To: Dave Martin; +Cc: Marc Zyngier, linux-arm-kernel, kvmarm

Hi Dave,

On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> Hi all,
> 
> SVE adds some new registers, and their size depends on the hardware ando
> on runtime sysreg settings.
> 
> Before coding something, I'd like to get people's views on my current
> approach here.
> 
> --8<--
> 
> New vcpu feature flag:
> /*
>  * userspace can support regs up to at least 2048 bits in size via ioctl,
>  * and will honour the size field in the reg iD
>  */
> #define KVM_ARM_VCPU_LARGE_REGS	4
> 
> Should we just error out of userspace fails to set this on a system that
> supports SVE, or is that too brutal?  

That would break older userspace on newer kernels on certain hardware,
which would certainly not be acceptable.  Or did I misunderstand?

> If we do treat that as an error,
> then we can unconditionally enable SVE for the guest when the host
> supports it -- which cuts down on unnecessary implementation complexity.

I think it makes sense to be able to disable SVE, especially if there's
high overhead related to context switching the state.  Since you say the
implementation complexity is unnecessary, I may be missing some larger
point?

> 
> Alternatively, we would need logic to disable SVE if userspace is too
> old, i.e., doesn't set this flag.  Then we might need to enforce that
> the flag is set the same on every vcpu -- but from the kernel's PoV
> it probably doesn't matter.

Not sure I understand why it doesn't matter from the kernel's PoV.

I think SVE should be disabled by default (as it is now) and then we
should expose a capability (potentially simply via the vcpu attributes
being present), and let userspace enable SVE and set a vector length.

It makes sense that userspace needs to know about SVE for VMs to use it,
doesn't it?

I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
being way too optimistic about the ARM ecosystem here?  Just like we
don't model big.LITTLE, I think we should enforce in the kernel, that
userspace configures all VCPUs with the same SVE properties.

> 
> 
> /*
>  * For the SVE regs, we add some new reg IDs.
>  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
>  * slices.  This is sufficient for only a single slice to be required
>  * per register for SVE, but ensures expansion room in case future arch
>  * versions grow the maximum size.
>  */

I don't understand the last part of this comment?

> #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)

Shift left by KVM_REG_SIZE_MASK?  I'm confused.

> #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> 		((n) << 5) | (i))
> #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> 		(((n) + 32) << 5) | (i))
> #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> 	 KVM_REG_ARM64_SVE_P(16, i)
> 
> For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> 

This is hard to read and understand the way presented here.  I would
suggest you formulate this suggestion in the form of an RFC patch to
Documentation/virtual/kvm/api.txt plus the header definitions.

(I'm not sure where to look to look to decode the "<< 5" and the
" + 32) << 5)" stuff above.

> 
> Bits above the max vector length could be

Which bits are these and where are they, and why do we have them?

Is the max vector length the max architecturally (current version)
defined length, or what is chosen for this VM?

>  * don't care (or not copied at all) on read; ignored on write
>  * zero on read; ignored on write
>  * zero on read; must be zero on write
> 
> Bits between the current and max vector length are trickier to specify:
> the "current" vector length for ioctl access is ill-defined, because
> we would need to specify ordering dependencies between Zn/Pn/FFR access
> and access to ZCR_EL1.

I think you want userspace to be able to read/write these values in any
order compared to configuring SVE for the VM, and then fix up whatever
needs masking etc. in the kernel later, if possible.  Ordering
requirements to userspace accesses have shown to be hard to enforce and
get right in the past.

What happens on hardware if you give a certain vector length to EL0, EL0
writes a value of the full length, and then EL1 restricts the length to
something smaller, and subsequently expands it again?  Is the original
full value visible or are some bits potentially lost?  IOW, can't we
rely on what the architecture says here?

> 
> So, it may be simpler to expose the full maximum supported vector size
> unconditionally through ioctl, and pack/unpack as necessary.

yes, I think this was what I tried to say.

> 
> Currently, data is packed in the vcpu struct in a vector length
> dependent format, since this seems optimal for low-level save/restore,
> so there will be potential data loss / zero padding when converting.
> 
> This may cause some unexpected effects.  For example:
> 
> KVM_SET_ONE_REG(ZCR_EL1, 0) 
> /* Guest's current vector length will be 128 bits when started */
> 
> KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> 
> KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> 
> KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> 

I really don't know how to parse this or what the point here is?  Sorry.

> 
> Since the guest should be treated mostly as a black box, I'm not sure
> how big a deal this is.  The guest _might_ have explicitly set those
> bits to 0... who are we to say?

How big a deal what is?  I'm lost.

> 
> Can anyone think of a scenario where effects like this would matter?
> 

I think we need more information.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-11-22 19:52   ` Christoffer Dall
  0 siblings, 0 replies; 18+ messages in thread
From: Christoffer Dall @ 2017-11-22 19:52 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Dave,

On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> Hi all,
> 
> SVE adds some new registers, and their size depends on the hardware ando
> on runtime sysreg settings.
> 
> Before coding something, I'd like to get people's views on my current
> approach here.
> 
> --8<--
> 
> New vcpu feature flag:
> /*
>  * userspace can support regs up to at least 2048 bits in size via ioctl,
>  * and will honour the size field in the reg iD
>  */
> #define KVM_ARM_VCPU_LARGE_REGS	4
> 
> Should we just error out of userspace fails to set this on a system that
> supports SVE, or is that too brutal?  

That would break older userspace on newer kernels on certain hardware,
which would certainly not be acceptable.  Or did I misunderstand?

> If we do treat that as an error,
> then we can unconditionally enable SVE for the guest when the host
> supports it -- which cuts down on unnecessary implementation complexity.

I think it makes sense to be able to disable SVE, especially if there's
high overhead related to context switching the state.  Since you say the
implementation complexity is unnecessary, I may be missing some larger
point?

> 
> Alternatively, we would need logic to disable SVE if userspace is too
> old, i.e., doesn't set this flag.  Then we might need to enforce that
> the flag is set the same on every vcpu -- but from the kernel's PoV
> it probably doesn't matter.

Not sure I understand why it doesn't matter from the kernel's PoV.

I think SVE should be disabled by default (as it is now) and then we
should expose a capability (potentially simply via the vcpu attributes
being present), and let userspace enable SVE and set a vector length.

It makes sense that userspace needs to know about SVE for VMs to use it,
doesn't it?

I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
being way too optimistic about the ARM ecosystem here?  Just like we
don't model big.LITTLE, I think we should enforce in the kernel, that
userspace configures all VCPUs with the same SVE properties.

> 
> 
> /*
>  * For the SVE regs, we add some new reg IDs.
>  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
>  * slices.  This is sufficient for only a single slice to be required
>  * per register for SVE, but ensures expansion room in case future arch
>  * versions grow the maximum size.
>  */

I don't understand the last part of this comment?

> #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)

Shift left by KVM_REG_SIZE_MASK?  I'm confused.

> #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> 		((n) << 5) | (i))
> #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> 		(((n) + 32) << 5) | (i))
> #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> 	 KVM_REG_ARM64_SVE_P(16, i)
> 
> For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> 

This is hard to read and understand the way presented here.  I would
suggest you formulate this suggestion in the form of an RFC patch to
Documentation/virtual/kvm/api.txt plus the header definitions.

(I'm not sure where to look to look to decode the "<< 5" and the
" + 32) << 5)" stuff above.

> 
> Bits above the max vector length could be

Which bits are these and where are they, and why do we have them?

Is the max vector length the max architecturally (current version)
defined length, or what is chosen for this VM?

>  * don't care (or not copied at all) on read; ignored on write
>  * zero on read; ignored on write
>  * zero on read; must be zero on write
> 
> Bits between the current and max vector length are trickier to specify:
> the "current" vector length for ioctl access is ill-defined, because
> we would need to specify ordering dependencies between Zn/Pn/FFR access
> and access to ZCR_EL1.

I think you want userspace to be able to read/write these values in any
order compared to configuring SVE for the VM, and then fix up whatever
needs masking etc. in the kernel later, if possible.  Ordering
requirements to userspace accesses have shown to be hard to enforce and
get right in the past.

What happens on hardware if you give a certain vector length to EL0, EL0
writes a value of the full length, and then EL1 restricts the length to
something smaller, and subsequently expands it again?  Is the original
full value visible or are some bits potentially lost?  IOW, can't we
rely on what the architecture says here?

> 
> So, it may be simpler to expose the full maximum supported vector size
> unconditionally through ioctl, and pack/unpack as necessary.

yes, I think this was what I tried to say.

> 
> Currently, data is packed in the vcpu struct in a vector length
> dependent format, since this seems optimal for low-level save/restore,
> so there will be potential data loss / zero padding when converting.
> 
> This may cause some unexpected effects.  For example:
> 
> KVM_SET_ONE_REG(ZCR_EL1, 0) 
> /* Guest's current vector length will be 128 bits when started */
> 
> KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> 
> KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> 
> KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> 

I really don't know how to parse this or what the point here is?  Sorry.

> 
> Since the guest should be treated mostly as a black box, I'm not sure
> how big a deal this is.  The guest _might_ have explicitly set those
> bits to 0... who are we to say?

How big a deal what is?  I'm lost.

> 
> Can anyone think of a scenario where effects like this would matter?
> 

I think we need more information.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] KVM API extensions for SVE
  2017-11-22 19:52   ` Christoffer Dall
@ 2017-11-23 18:40     ` Dave Martin
  -1 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2017-11-23 18:40 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, kvmarm, linux-arm-kernel

On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> Hi Dave,
> 
> On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > Hi all,
> > 
> > SVE adds some new registers, and their size depends on the hardware ando
> > on runtime sysreg settings.
> > 
> > Before coding something, I'd like to get people's views on my current
> > approach here.
> > 
> > --8<--
> > 
> > New vcpu feature flag:
> > /*
> >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> >  * and will honour the size field in the reg iD
> >  */
> > #define KVM_ARM_VCPU_LARGE_REGS	4
> > 
> > Should we just error out of userspace fails to set this on a system that
> > supports SVE, or is that too brutal?  
> 
> That would break older userspace on newer kernels on certain hardware,
> which would certainly not be acceptable.  Or did I misunderstand?

Yes, which is probably bad.

I'm still trying to gauge the policy regarding backwards compatibility.


So, I guess you're saying is that we should disable SVE for the guest
but still run it in this case.


This creates another issue: if SVE is supported by the host kernel
but not enabled for the guest, do I need to hist the SVE regs from
KVM_GET_REG_LIST?

If not, a guest that doesn't have SVE created on a host that supports
SVE would not be migratable to a host kernel that doesn't support SVE
but otherwise could run the guest:  as I understand it, the attempt to
restore the SVE regs on the target node would fail because the host
kernel doesn't recognise those regs.

Or do we not claim to support backwards compatibility for migration?

> 
> > If we do treat that as an error,
> > then we can unconditionally enable SVE for the guest when the host
> > supports it -- which cuts down on unnecessary implementation complexity.
> 
> I think it makes sense to be able to disable SVE, especially if there's
> high overhead related to context switching the state.  Since you say the
> implementation complexity is unnecessary, I may be missing some larger
> point?

I don't think it's all that bad, but there doesn't seem to be any
example of an optional architecture feature for a guest today --
so I wanted to check before setting a precedent here.

Would "SVE enabled" be better as an attribute, rather than a
feature, or does it not really matter?

> > 
> > Alternatively, we would need logic to disable SVE if userspace is too
> > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > the flag is set the same on every vcpu -- but from the kernel's PoV
> > it probably doesn't matter.
> 
> Not sure I understand why it doesn't matter from the kernel's PoV.
> 
> I think SVE should be disabled by default (as it is now) and then we
> should expose a capability (potentially simply via the vcpu attributes
> being present), and let userspace enable SVE and set a vector length.

Yes, but aren't all the attributes/features per-vcpu?

> It makes sense that userspace needs to know about SVE for VMs to use it,
> doesn't it?

Yes and no.  Except for debug purposes I don't see why userspace needs
to know anything execpt how to handle large registers through the ioctl
interface.

> I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> being way too optimistic about the ARM ecosystem here?  Just like we
> don't model big.LITTLE, I think we should enforce in the kernel, that

The kernel follows the same policy: if SVE is not present on all
physical CPUs we disable it completely and hide it from guests and
userspace.

For vector length I'm a little more permissive: the max vector length
would be clamped to the minimum commonly supported vector length.

> userspace configures all VCPUs with the same SVE properties.

OK, so long as you think it's not superfluous to do it, then I'm happy
to do it.

> > 
> > /*
> >  * For the SVE regs, we add some new reg IDs.
> >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> >  * slices.  This is sufficient for only a single slice to be required
> >  * per register for SVE, but ensures expansion room in case future arch
> >  * versions grow the maximum size.
> >  */
> 
> I don't understand the last part of this comment?

This may be explained better in by response below.

> > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> 
> Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> 
> > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > 		((n) << 5) | (i))
> > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > 		(((n) + 32) << 5) | (i))
> > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > 	 KVM_REG_ARM64_SVE_P(16, i)
> > 
> > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > 
> 
> This is hard to read and understand the way presented here.  I would
> suggest you formulate this suggestion in the form of an RFC patch to
> Documentation/virtual/kvm/api.txt plus the header definitions.

Sure, I hadn't figured out the best way to present this: I was thinking
aloud.

> (I'm not sure where to look to look to decode the "<< 5" and the
> " + 32) << 5)" stuff above.

The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
They are numbered serially.

However, the SVE architecture leaves the possibility of future
expansion open, up to 32 times the current maximum.

The KVM reg API doesn't support such ridiculously huge registers,
so my proposal is to slice them up, indexed by the value in the
bottom 5 bits of the reg ID.  This requires the "register ID"
field to be shifted up by 5 bits.

If the regs are not oversized (never, for the current architecture),
then we simply don't list those extra slices via KVM_GET_REG_LIST.

> > Bits above the max vector length could be
> 
> Which bits are these and where are they, and why do we have them?

The KVM register size via ioctl is fixed at 2048 bits here.  Since
the system might not support vectors that large, then bits 2047:1024
in the ioctl payload wouldn't map to any register bits in the hardware.
Should KVM still store them somewhere?  Should they logically exist
for the purpose of the ioctl() API?

Making the size dynamic to avoid surplus bits doesn't work, because KVM
only supports power-of-two reg sizes, whereas SVE can support non-power-
of-two sizes.

> Is the max vector length the max architecturally (current version)
> defined length, or what is chosen for this VM?

For now, that's an open question.

> >  * don't care (or not copied at all) on read; ignored on write
> >  * zero on read; ignored on write
> >  * zero on read; must be zero on write
> > 
> > Bits between the current and max vector length are trickier to specify:
> > the "current" vector length for ioctl access is ill-defined, because
> > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > and access to ZCR_EL1.
> 
> I think you want userspace to be able to read/write these values in any
> order compared to configuring SVE for the VM, and then fix up whatever
> needs masking etc. in the kernel later, if possible.  Ordering
> requirements to userspace accesses have shown to be hard to enforce and
> get right in the past.

So I've heard from other people.

> What happens on hardware if you give a certain vector length to EL0, EL0
> writes a value of the full length, and then EL1 restricts the length to
> something smaller, and subsequently expands it again?  Is the original
> full value visible or are some bits potentially lost?  IOW, can't we
> rely on what the architecture says here?

The architecture says that bits that are hidden and then revealed again
are either preserved whilst hidden, or zeroed.

Opinion differs on whether that's a good thing to expose in ABI: Will
considered it unacceptable to expose this kind of behaviour around
syscalls from userspace for example, so I currently always zero the
bits in that case even though it's slightly more expensive.

The concern here was that userspace might unintentionally rely on
the affected register bits being preserved around syscalls when this
is not guaranteed by the implementation.

This does not mean that similar design considerations apply to the KVM
ioctl interface though.

> > 
> > So, it may be simpler to expose the full maximum supported vector size
> > unconditionally through ioctl, and pack/unpack as necessary.
> 
> yes, I think this was what I tried to say.
> 
> > 
> > Currently, data is packed in the vcpu struct in a vector length
> > dependent format, since this seems optimal for low-level save/restore,
> > so there will be potential data loss / zero padding when converting.
> > 
> > This may cause some unexpected effects.  For example:
> > 
> > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > /* Guest's current vector length will be 128 bits when started */
> > 
> > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > 
> > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > 
> > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > 
> 
> I really don't know how to parse this or what the point here is?  Sorry.

It means that for the ioctl interface, "obvious" guarantees like "if you
read a register you get back the last value written" don't work quite as
expected.  Some bits may have disappeared, or not, depending on the
precise scenario.

> > 
> > Since the guest should be treated mostly as a black box, I'm not sure
> > how big a deal this is.  The guest _might_ have explicitly set those
> > bits to 0... who are we to say?
> 
> How big a deal what is?  I'm lost.
> 
> > 
> > Can anyone think of a scenario where effects like this would matter?
> > 
> 
> I think we need more information.

See if my comments above throw any light on this.

I may be trying to make an issue out of nothing here.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-11-23 18:40     ` Dave Martin
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2017-11-23 18:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> Hi Dave,
> 
> On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > Hi all,
> > 
> > SVE adds some new registers, and their size depends on the hardware ando
> > on runtime sysreg settings.
> > 
> > Before coding something, I'd like to get people's views on my current
> > approach here.
> > 
> > --8<--
> > 
> > New vcpu feature flag:
> > /*
> >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> >  * and will honour the size field in the reg iD
> >  */
> > #define KVM_ARM_VCPU_LARGE_REGS	4
> > 
> > Should we just error out of userspace fails to set this on a system that
> > supports SVE, or is that too brutal?  
> 
> That would break older userspace on newer kernels on certain hardware,
> which would certainly not be acceptable.  Or did I misunderstand?

Yes, which is probably bad.

I'm still trying to gauge the policy regarding backwards compatibility.


So, I guess you're saying is that we should disable SVE for the guest
but still run it in this case.


This creates another issue: if SVE is supported by the host kernel
but not enabled for the guest, do I need to hist the SVE regs from
KVM_GET_REG_LIST?

If not, a guest that doesn't have SVE created on a host that supports
SVE would not be migratable to a host kernel that doesn't support SVE
but otherwise could run the guest:  as I understand it, the attempt to
restore the SVE regs on the target node would fail because the host
kernel doesn't recognise those regs.

Or do we not claim to support backwards compatibility for migration?

> 
> > If we do treat that as an error,
> > then we can unconditionally enable SVE for the guest when the host
> > supports it -- which cuts down on unnecessary implementation complexity.
> 
> I think it makes sense to be able to disable SVE, especially if there's
> high overhead related to context switching the state.  Since you say the
> implementation complexity is unnecessary, I may be missing some larger
> point?

I don't think it's all that bad, but there doesn't seem to be any
example of an optional architecture feature for a guest today --
so I wanted to check before setting a precedent here.

Would "SVE enabled" be better as an attribute, rather than a
feature, or does it not really matter?

> > 
> > Alternatively, we would need logic to disable SVE if userspace is too
> > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > the flag is set the same on every vcpu -- but from the kernel's PoV
> > it probably doesn't matter.
> 
> Not sure I understand why it doesn't matter from the kernel's PoV.
> 
> I think SVE should be disabled by default (as it is now) and then we
> should expose a capability (potentially simply via the vcpu attributes
> being present), and let userspace enable SVE and set a vector length.

Yes, but aren't all the attributes/features per-vcpu?

> It makes sense that userspace needs to know about SVE for VMs to use it,
> doesn't it?

Yes and no.  Except for debug purposes I don't see why userspace needs
to know anything execpt how to handle large registers through the ioctl
interface.

> I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> being way too optimistic about the ARM ecosystem here?  Just like we
> don't model big.LITTLE, I think we should enforce in the kernel, that

The kernel follows the same policy: if SVE is not present on all
physical CPUs we disable it completely and hide it from guests and
userspace.

For vector length I'm a little more permissive: the max vector length
would be clamped to the minimum commonly supported vector length.

> userspace configures all VCPUs with the same SVE properties.

OK, so long as you think it's not superfluous to do it, then I'm happy
to do it.

> > 
> > /*
> >  * For the SVE regs, we add some new reg IDs.
> >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> >  * slices.  This is sufficient for only a single slice to be required
> >  * per register for SVE, but ensures expansion room in case future arch
> >  * versions grow the maximum size.
> >  */
> 
> I don't understand the last part of this comment?

This may be explained better in by response below.

> > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> 
> Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> 
> > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > 		((n) << 5) | (i))
> > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > 		(((n) + 32) << 5) | (i))
> > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > 	 KVM_REG_ARM64_SVE_P(16, i)
> > 
> > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > 
> 
> This is hard to read and understand the way presented here.  I would
> suggest you formulate this suggestion in the form of an RFC patch to
> Documentation/virtual/kvm/api.txt plus the header definitions.

Sure, I hadn't figured out the best way to present this: I was thinking
aloud.

> (I'm not sure where to look to look to decode the "<< 5" and the
> " + 32) << 5)" stuff above.

The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
They are numbered serially.

However, the SVE architecture leaves the possibility of future
expansion open, up to 32 times the current maximum.

The KVM reg API doesn't support such ridiculously huge registers,
so my proposal is to slice them up, indexed by the value in the
bottom 5 bits of the reg ID.  This requires the "register ID"
field to be shifted up by 5 bits.

If the regs are not oversized (never, for the current architecture),
then we simply don't list those extra slices via KVM_GET_REG_LIST.

> > Bits above the max vector length could be
> 
> Which bits are these and where are they, and why do we have them?

The KVM register size via ioctl is fixed at 2048 bits here.  Since
the system might not support vectors that large, then bits 2047:1024
in the ioctl payload wouldn't map to any register bits in the hardware.
Should KVM still store them somewhere?  Should they logically exist
for the purpose of the ioctl() API?

Making the size dynamic to avoid surplus bits doesn't work, because KVM
only supports power-of-two reg sizes, whereas SVE can support non-power-
of-two sizes.

> Is the max vector length the max architecturally (current version)
> defined length, or what is chosen for this VM?

For now, that's an open question.

> >  * don't care (or not copied at all) on read; ignored on write
> >  * zero on read; ignored on write
> >  * zero on read; must be zero on write
> > 
> > Bits between the current and max vector length are trickier to specify:
> > the "current" vector length for ioctl access is ill-defined, because
> > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > and access to ZCR_EL1.
> 
> I think you want userspace to be able to read/write these values in any
> order compared to configuring SVE for the VM, and then fix up whatever
> needs masking etc. in the kernel later, if possible.  Ordering
> requirements to userspace accesses have shown to be hard to enforce and
> get right in the past.

So I've heard from other people.

> What happens on hardware if you give a certain vector length to EL0, EL0
> writes a value of the full length, and then EL1 restricts the length to
> something smaller, and subsequently expands it again?  Is the original
> full value visible or are some bits potentially lost?  IOW, can't we
> rely on what the architecture says here?

The architecture says that bits that are hidden and then revealed again
are either preserved whilst hidden, or zeroed.

Opinion differs on whether that's a good thing to expose in ABI: Will
considered it unacceptable to expose this kind of behaviour around
syscalls from userspace for example, so I currently always zero the
bits in that case even though it's slightly more expensive.

The concern here was that userspace might unintentionally rely on
the affected register bits being preserved around syscalls when this
is not guaranteed by the implementation.

This does not mean that similar design considerations apply to the KVM
ioctl interface though.

> > 
> > So, it may be simpler to expose the full maximum supported vector size
> > unconditionally through ioctl, and pack/unpack as necessary.
> 
> yes, I think this was what I tried to say.
> 
> > 
> > Currently, data is packed in the vcpu struct in a vector length
> > dependent format, since this seems optimal for low-level save/restore,
> > so there will be potential data loss / zero padding when converting.
> > 
> > This may cause some unexpected effects.  For example:
> > 
> > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > /* Guest's current vector length will be 128 bits when started */
> > 
> > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > 
> > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > 
> > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > 
> 
> I really don't know how to parse this or what the point here is?  Sorry.

It means that for the ioctl interface, "obvious" guarantees like "if you
read a register you get back the last value written" don't work quite as
expected.  Some bits may have disappeared, or not, depending on the
precise scenario.

> > 
> > Since the guest should be treated mostly as a black box, I'm not sure
> > how big a deal this is.  The guest _might_ have explicitly set those
> > bits to 0... who are we to say?
> 
> How big a deal what is?  I'm lost.
> 
> > 
> > Can anyone think of a scenario where effects like this would matter?
> > 
> 
> I think we need more information.

See if my comments above throw any light on this.

I may be trying to make an issue out of nothing here.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] KVM API extensions for SVE
  2017-11-23 18:40     ` Dave Martin
@ 2017-11-24 14:45       ` Christoffer Dall
  -1 siblings, 0 replies; 18+ messages in thread
From: Christoffer Dall @ 2017-11-24 14:45 UTC (permalink / raw)
  To: Dave Martin; +Cc: Marc Zyngier, kvmarm, linux-arm-kernel

On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
> On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> > Hi Dave,
> > 
> > On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > > Hi all,
> > > 
> > > SVE adds some new registers, and their size depends on the hardware ando
> > > on runtime sysreg settings.
> > > 
> > > Before coding something, I'd like to get people's views on my current
> > > approach here.
> > > 
> > > --8<--
> > > 
> > > New vcpu feature flag:
> > > /*
> > >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> > >  * and will honour the size field in the reg iD
> > >  */
> > > #define KVM_ARM_VCPU_LARGE_REGS	4
> > > 
> > > Should we just error out of userspace fails to set this on a system that
> > > supports SVE, or is that too brutal?  
> > 
> > That would break older userspace on newer kernels on certain hardware,
> > which would certainly not be acceptable.  Or did I misunderstand?
> 
> Yes, which is probably bad.
> 
> I'm still trying to gauge the policy regarding backwards compatibility.
> 

Think if QEMU as any other userspace application.  We should really
never regress userspace.

> 
> So, I guess you're saying is that we should disable SVE for the guest
> but still run it in this case.
> 

That's slightly more debatable, but doing it any other way would break
any form of migration that relies on the guest configuration of SVE and
userspace would have no way to know.  I think this sounds like a bad
idea.

> 
> This creates another issue: if SVE is supported by the host kernel
> but not enabled for the guest, do I need to hist the SVE regs from
> KVM_GET_REG_LIST?

I don't think so.  We should check with QEMU and kvmtool, but I think
the way it works is that userspace matches the KVM list with its own
internal list, and matches up anything it knows about, and discards the
rest.  Certainly in the past we haven't been afraid of adding registers
to KVM_GET_REG_LIST.

> 
> If not, a guest that doesn't have SVE created on a host that supports
> SVE would not be migratable to a host kernel that doesn't support SVE
> but otherwise could run the guest:  as I understand it, the attempt to
> restore the SVE regs on the target node would fail because the host
> kernel doesn't recognise those regs.
> 
> Or do we not claim to support backwards compatibility for migration?
> 

I think QEMU and higher level tools like libvirt would have the
knowledge to figure this out and implement what they want, so from the
kernel's point of view, I think we can simply export the registers when
SVE is supported.

> > 
> > > If we do treat that as an error,
> > > then we can unconditionally enable SVE for the guest when the host
> > > supports it -- which cuts down on unnecessary implementation complexity.
> > 
> > I think it makes sense to be able to disable SVE, especially if there's
> > high overhead related to context switching the state.  Since you say the
> > implementation complexity is unnecessary, I may be missing some larger
> > point?
> 
> I don't think it's all that bad, but there doesn't seem to be any
> example of an optional architecture feature for a guest today --
> so I wanted to check before setting a precedent here.

We don't enable the GIC unless userspace asks for it, same with the
PMU...

> 
> Would "SVE enabled" be better as an attribute, rather than a
> feature, or does it not really matter?
> 

Doesn't matter.  It's a question of what you need in terms of the ABI.

> > > 
> > > Alternatively, we would need logic to disable SVE if userspace is too
> > > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > > the flag is set the same on every vcpu -- but from the kernel's PoV
> > > it probably doesn't matter.
> > 
> > Not sure I understand why it doesn't matter from the kernel's PoV.
> > 
> > I think SVE should be disabled by default (as it is now) and then we
> > should expose a capability (potentially simply via the vcpu attributes
> > being present), and let userspace enable SVE and set a vector length.
> 
> Yes, but aren't all the attributes/features per-vcpu?
> 

Yes, so the kernel should check that everything is configured
consistently across all VCPUs.

> > It makes sense that userspace needs to know about SVE for VMs to use it,
> > doesn't it?
> 
> Yes and no.  Except for debug purposes I don't see why userspace needs
> to know anything execpt how to handle large registers through the ioctl
> interface.
> 

Migration is another reason.

> > I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> > being way too optimistic about the ARM ecosystem here?  Just like we
> > don't model big.LITTLE, I think we should enforce in the kernel, that
> 
> The kernel follows the same policy: if SVE is not present on all
> physical CPUs we disable it completely and hide it from guests and
> userspace.
> 
> For vector length I'm a little more permissive: the max vector length
> would be clamped to the minimum commonly supported vector length.
> 

Ok, so KVM could implement the same.  Or we could just be reasonable and
require userspace to configure all VCPUs the same.

> > userspace configures all VCPUs with the same SVE properties.
> 
> OK, so long as you think it's not superfluous to do it, then I'm happy
> to do it.
> 
> > > 
> > > /*
> > >  * For the SVE regs, we add some new reg IDs.
> > >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> > >  * slices.  This is sufficient for only a single slice to be required
> > >  * per register for SVE, but ensures expansion room in case future arch
> > >  * versions grow the maximum size.
> > >  */
> > 
> > I don't understand the last part of this comment?
> 
> This may be explained better in by response below.
> 
> > > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> > 
> > Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> > 
> > > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > > 		((n) << 5) | (i))
> > > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > > 		(((n) + 32) << 5) | (i))
> > > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > > 	 KVM_REG_ARM64_SVE_P(16, i)
> > > 
> > > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > > 
> > 
> > This is hard to read and understand the way presented here.  I would
> > suggest you formulate this suggestion in the form of an RFC patch to
> > Documentation/virtual/kvm/api.txt plus the header definitions.
> 
> Sure, I hadn't figured out the best way to present this: I was thinking
> aloud.
> 
> > (I'm not sure where to look to look to decode the "<< 5" and the
> > " + 32) << 5)" stuff above.
> 
> The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
> They are numbered serially.
> 
> However, the SVE architecture leaves the possibility of future
> expansion open, up to 32 times the current maximum.
> 
> The KVM reg API doesn't support such ridiculously huge registers,
> so my proposal is to slice them up, indexed by the value in the
> bottom 5 bits of the reg ID.  This requires the "register ID"
> field to be shifted up by 5 bits.
> 
> If the regs are not oversized (never, for the current architecture),
> then we simply don't list those extra slices via KVM_GET_REG_LIST.
> 
> > > Bits above the max vector length could be
> > 
> > Which bits are these and where are they, and why do we have them?
> 
> The KVM register size via ioctl is fixed at 2048 bits here.  Since
> the system might not support vectors that large, then bits 2047:1024
> in the ioctl payload wouldn't map to any register bits in the hardware.
> Should KVM still store them somewhere?  Should they logically exist
> for the purpose of the ioctl() API?
> 
> Making the size dynamic to avoid surplus bits doesn't work, because KVM
> only supports power-of-two reg sizes, whereas SVE can support non-power-
> of-two sizes.
> 
> > Is the max vector length the max architecturally (current version)
> > defined length, or what is chosen for this VM?
> 
> For now, that's an open question.
> 
> > >  * don't care (or not copied at all) on read; ignored on write
> > >  * zero on read; ignored on write
> > >  * zero on read; must be zero on write
> > > 
> > > Bits between the current and max vector length are trickier to specify:
> > > the "current" vector length for ioctl access is ill-defined, because
> > > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > > and access to ZCR_EL1.
> > 
> > I think you want userspace to be able to read/write these values in any
> > order compared to configuring SVE for the VM, and then fix up whatever
> > needs masking etc. in the kernel later, if possible.  Ordering
> > requirements to userspace accesses have shown to be hard to enforce and
> > get right in the past.
> 
> So I've heard from other people.
> 
> > What happens on hardware if you give a certain vector length to EL0, EL0
> > writes a value of the full length, and then EL1 restricts the length to
> > something smaller, and subsequently expands it again?  Is the original
> > full value visible or are some bits potentially lost?  IOW, can't we
> > rely on what the architecture says here?
> 
> The architecture says that bits that are hidden and then revealed again
> are either preserved whilst hidden, or zeroed.
> 
> Opinion differs on whether that's a good thing to expose in ABI: Will
> considered it unacceptable to expose this kind of behaviour around
> syscalls from userspace for example, so I currently always zero the
> bits in that case even though it's slightly more expensive.
> 
> The concern here was that userspace might unintentionally rely on
> the affected register bits being preserved around syscalls when this
> is not guaranteed by the implementation.
> 
> This does not mean that similar design considerations apply to the KVM
> ioctl interface though.
> 

It sounds to me that the most simple thing is that the register
interface to userspace exposes the full possible register width in both
directions, and we apply a mask whenever we need to.

> > > 
> > > So, it may be simpler to expose the full maximum supported vector size
> > > unconditionally through ioctl, and pack/unpack as necessary.
> > 
> > yes, I think this was what I tried to say.
> > 
> > > 
> > > Currently, data is packed in the vcpu struct in a vector length
> > > dependent format, since this seems optimal for low-level save/restore,
> > > so there will be potential data loss / zero padding when converting.
> > > 
> > > This may cause some unexpected effects.  For example:
> > > 
> > > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > > /* Guest's current vector length will be 128 bits when started */
> > > 
> > > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > > 
> > > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > > 
> > > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > > 
> > 
> > I really don't know how to parse this or what the point here is?  Sorry.
> 
> It means that for the ioctl interface, "obvious" guarantees like "if you
> read a register you get back the last value written" don't work quite as
> expected.  Some bits may have disappeared, or not, depending on the
> precise scenario.
> 
> > > 
> > > Since the guest should be treated mostly as a black box, I'm not sure
> > > how big a deal this is.  The guest _might_ have explicitly set those
> > > bits to 0... who are we to say?
> > 
> > How big a deal what is?  I'm lost.
> > 
> > > 
> > > Can anyone think of a scenario where effects like this would matter?
> > > 
> > 
> > I think we need more information.
> 
> See if my comments above throw any light on this.
> 

So you're saying even if we try the "expose full width and read back
hidden values" approach, those hidden values may be changed when
executing the guest, due to the KVM implementation or the way hardware
works, is that the point?

I think the KVM interface should be designed similarly to being able to
probe a hardware CPU's register state at various stages of execution.

So, for example, if you write content to hidden bits in the SVE
registers from EL2 on real hardware and limit the length using ZCR_EL2,
and then run a bunch of code in EL1/0, and then come back to EL2 and
examine the registers again, then we should model that behavior in
software.

In other words, I think we have to model this more closely to what
guarantees ZCR_EL2 gives us, and not ZCR_EL1, and choose something
architecturally compliant which is reasonable to implement.

Hope this helps,
-Christoffer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-11-24 14:45       ` Christoffer Dall
  0 siblings, 0 replies; 18+ messages in thread
From: Christoffer Dall @ 2017-11-24 14:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
> On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> > Hi Dave,
> > 
> > On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > > Hi all,
> > > 
> > > SVE adds some new registers, and their size depends on the hardware ando
> > > on runtime sysreg settings.
> > > 
> > > Before coding something, I'd like to get people's views on my current
> > > approach here.
> > > 
> > > --8<--
> > > 
> > > New vcpu feature flag:
> > > /*
> > >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> > >  * and will honour the size field in the reg iD
> > >  */
> > > #define KVM_ARM_VCPU_LARGE_REGS	4
> > > 
> > > Should we just error out of userspace fails to set this on a system that
> > > supports SVE, or is that too brutal?  
> > 
> > That would break older userspace on newer kernels on certain hardware,
> > which would certainly not be acceptable.  Or did I misunderstand?
> 
> Yes, which is probably bad.
> 
> I'm still trying to gauge the policy regarding backwards compatibility.
> 

Think if QEMU as any other userspace application.  We should really
never regress userspace.

> 
> So, I guess you're saying is that we should disable SVE for the guest
> but still run it in this case.
> 

That's slightly more debatable, but doing it any other way would break
any form of migration that relies on the guest configuration of SVE and
userspace would have no way to know.  I think this sounds like a bad
idea.

> 
> This creates another issue: if SVE is supported by the host kernel
> but not enabled for the guest, do I need to hist the SVE regs from
> KVM_GET_REG_LIST?

I don't think so.  We should check with QEMU and kvmtool, but I think
the way it works is that userspace matches the KVM list with its own
internal list, and matches up anything it knows about, and discards the
rest.  Certainly in the past we haven't been afraid of adding registers
to KVM_GET_REG_LIST.

> 
> If not, a guest that doesn't have SVE created on a host that supports
> SVE would not be migratable to a host kernel that doesn't support SVE
> but otherwise could run the guest:  as I understand it, the attempt to
> restore the SVE regs on the target node would fail because the host
> kernel doesn't recognise those regs.
> 
> Or do we not claim to support backwards compatibility for migration?
> 

I think QEMU and higher level tools like libvirt would have the
knowledge to figure this out and implement what they want, so from the
kernel's point of view, I think we can simply export the registers when
SVE is supported.

> > 
> > > If we do treat that as an error,
> > > then we can unconditionally enable SVE for the guest when the host
> > > supports it -- which cuts down on unnecessary implementation complexity.
> > 
> > I think it makes sense to be able to disable SVE, especially if there's
> > high overhead related to context switching the state.  Since you say the
> > implementation complexity is unnecessary, I may be missing some larger
> > point?
> 
> I don't think it's all that bad, but there doesn't seem to be any
> example of an optional architecture feature for a guest today --
> so I wanted to check before setting a precedent here.

We don't enable the GIC unless userspace asks for it, same with the
PMU...

> 
> Would "SVE enabled" be better as an attribute, rather than a
> feature, or does it not really matter?
> 

Doesn't matter.  It's a question of what you need in terms of the ABI.

> > > 
> > > Alternatively, we would need logic to disable SVE if userspace is too
> > > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > > the flag is set the same on every vcpu -- but from the kernel's PoV
> > > it probably doesn't matter.
> > 
> > Not sure I understand why it doesn't matter from the kernel's PoV.
> > 
> > I think SVE should be disabled by default (as it is now) and then we
> > should expose a capability (potentially simply via the vcpu attributes
> > being present), and let userspace enable SVE and set a vector length.
> 
> Yes, but aren't all the attributes/features per-vcpu?
> 

Yes, so the kernel should check that everything is configured
consistently across all VCPUs.

> > It makes sense that userspace needs to know about SVE for VMs to use it,
> > doesn't it?
> 
> Yes and no.  Except for debug purposes I don't see why userspace needs
> to know anything execpt how to handle large registers through the ioctl
> interface.
> 

Migration is another reason.

> > I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> > being way too optimistic about the ARM ecosystem here?  Just like we
> > don't model big.LITTLE, I think we should enforce in the kernel, that
> 
> The kernel follows the same policy: if SVE is not present on all
> physical CPUs we disable it completely and hide it from guests and
> userspace.
> 
> For vector length I'm a little more permissive: the max vector length
> would be clamped to the minimum commonly supported vector length.
> 

Ok, so KVM could implement the same.  Or we could just be reasonable and
require userspace to configure all VCPUs the same.

> > userspace configures all VCPUs with the same SVE properties.
> 
> OK, so long as you think it's not superfluous to do it, then I'm happy
> to do it.
> 
> > > 
> > > /*
> > >  * For the SVE regs, we add some new reg IDs.
> > >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> > >  * slices.  This is sufficient for only a single slice to be required
> > >  * per register for SVE, but ensures expansion room in case future arch
> > >  * versions grow the maximum size.
> > >  */
> > 
> > I don't understand the last part of this comment?
> 
> This may be explained better in by response below.
> 
> > > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> > 
> > Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> > 
> > > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > > 		((n) << 5) | (i))
> > > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > > 		(((n) + 32) << 5) | (i))
> > > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > > 	 KVM_REG_ARM64_SVE_P(16, i)
> > > 
> > > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > > 
> > 
> > This is hard to read and understand the way presented here.  I would
> > suggest you formulate this suggestion in the form of an RFC patch to
> > Documentation/virtual/kvm/api.txt plus the header definitions.
> 
> Sure, I hadn't figured out the best way to present this: I was thinking
> aloud.
> 
> > (I'm not sure where to look to look to decode the "<< 5" and the
> > " + 32) << 5)" stuff above.
> 
> The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
> They are numbered serially.
> 
> However, the SVE architecture leaves the possibility of future
> expansion open, up to 32 times the current maximum.
> 
> The KVM reg API doesn't support such ridiculously huge registers,
> so my proposal is to slice them up, indexed by the value in the
> bottom 5 bits of the reg ID.  This requires the "register ID"
> field to be shifted up by 5 bits.
> 
> If the regs are not oversized (never, for the current architecture),
> then we simply don't list those extra slices via KVM_GET_REG_LIST.
> 
> > > Bits above the max vector length could be
> > 
> > Which bits are these and where are they, and why do we have them?
> 
> The KVM register size via ioctl is fixed at 2048 bits here.  Since
> the system might not support vectors that large, then bits 2047:1024
> in the ioctl payload wouldn't map to any register bits in the hardware.
> Should KVM still store them somewhere?  Should they logically exist
> for the purpose of the ioctl() API?
> 
> Making the size dynamic to avoid surplus bits doesn't work, because KVM
> only supports power-of-two reg sizes, whereas SVE can support non-power-
> of-two sizes.
> 
> > Is the max vector length the max architecturally (current version)
> > defined length, or what is chosen for this VM?
> 
> For now, that's an open question.
> 
> > >  * don't care (or not copied at all) on read; ignored on write
> > >  * zero on read; ignored on write
> > >  * zero on read; must be zero on write
> > > 
> > > Bits between the current and max vector length are trickier to specify:
> > > the "current" vector length for ioctl access is ill-defined, because
> > > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > > and access to ZCR_EL1.
> > 
> > I think you want userspace to be able to read/write these values in any
> > order compared to configuring SVE for the VM, and then fix up whatever
> > needs masking etc. in the kernel later, if possible.  Ordering
> > requirements to userspace accesses have shown to be hard to enforce and
> > get right in the past.
> 
> So I've heard from other people.
> 
> > What happens on hardware if you give a certain vector length to EL0, EL0
> > writes a value of the full length, and then EL1 restricts the length to
> > something smaller, and subsequently expands it again?  Is the original
> > full value visible or are some bits potentially lost?  IOW, can't we
> > rely on what the architecture says here?
> 
> The architecture says that bits that are hidden and then revealed again
> are either preserved whilst hidden, or zeroed.
> 
> Opinion differs on whether that's a good thing to expose in ABI: Will
> considered it unacceptable to expose this kind of behaviour around
> syscalls from userspace for example, so I currently always zero the
> bits in that case even though it's slightly more expensive.
> 
> The concern here was that userspace might unintentionally rely on
> the affected register bits being preserved around syscalls when this
> is not guaranteed by the implementation.
> 
> This does not mean that similar design considerations apply to the KVM
> ioctl interface though.
> 

It sounds to me that the most simple thing is that the register
interface to userspace exposes the full possible register width in both
directions, and we apply a mask whenever we need to.

> > > 
> > > So, it may be simpler to expose the full maximum supported vector size
> > > unconditionally through ioctl, and pack/unpack as necessary.
> > 
> > yes, I think this was what I tried to say.
> > 
> > > 
> > > Currently, data is packed in the vcpu struct in a vector length
> > > dependent format, since this seems optimal for low-level save/restore,
> > > so there will be potential data loss / zero padding when converting.
> > > 
> > > This may cause some unexpected effects.  For example:
> > > 
> > > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > > /* Guest's current vector length will be 128 bits when started */
> > > 
> > > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > > 
> > > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > > 
> > > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > > 
> > 
> > I really don't know how to parse this or what the point here is?  Sorry.
> 
> It means that for the ioctl interface, "obvious" guarantees like "if you
> read a register you get back the last value written" don't work quite as
> expected.  Some bits may have disappeared, or not, depending on the
> precise scenario.
> 
> > > 
> > > Since the guest should be treated mostly as a black box, I'm not sure
> > > how big a deal this is.  The guest _might_ have explicitly set those
> > > bits to 0... who are we to say?
> > 
> > How big a deal what is?  I'm lost.
> > 
> > > 
> > > Can anyone think of a scenario where effects like this would matter?
> > > 
> > 
> > I think we need more information.
> 
> See if my comments above throw any light on this.
> 

So you're saying even if we try the "expose full width and read back
hidden values" approach, those hidden values may be changed when
executing the guest, due to the KVM implementation or the way hardware
works, is that the point?

I think the KVM interface should be designed similarly to being able to
probe a hardware CPU's register state at various stages of execution.

So, for example, if you write content to hidden bits in the SVE
registers from EL2 on real hardware and limit the length using ZCR_EL2,
and then run a bunch of code in EL1/0, and then come back to EL2 and
examine the registers again, then we should model that behavior in
software.

In other words, I think we have to model this more closely to what
guarantees ZCR_EL2 gives us, and not ZCR_EL1, and choose something
architecturally compliant which is reasonable to implement.

Hope this helps,
-Christoffer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] KVM API extensions for SVE
  2017-11-24 14:45       ` Christoffer Dall
@ 2017-11-24 15:03         ` Peter Maydell
  -1 siblings, 0 replies; 18+ messages in thread
From: Peter Maydell @ 2017-11-24 15:03 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Dave Martin, arm-mail-list, kvmarm

On 24 November 2017 at 14:45, Christoffer Dall <cdall@linaro.org> wrote:
> On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
>> This creates another issue: if SVE is supported by the host kernel
>> but not enabled for the guest, do I need to hist the SVE regs from
>> KVM_GET_REG_LIST?
>
> I don't think so.  We should check with QEMU and kvmtool, but I think
> the way it works is that userspace matches the KVM list with its own
> internal list, and matches up anything it knows about, and discards the
> rest.  Certainly in the past we haven't been afraid of adding registers
> to KVM_GET_REG_LIST.

QEMU userspace doesn't discard things it doesn't knows about, it
just blindly passes them along in the migration stream to the
destination.

The awkward bit for SVE is that its registers for KVM_GET_REG_LIST
will be bigger than 64 bits. If you do that for existing QEMU
binaries they'll assert because of the check in kvm_arm_init_cpreg_list()
that the register size is something it can handle. So we mustn't
list the SVE regs in GET_REG_LIST unless userspace has told us it
can handle them (by setting a feature bit in the arguments to
KVM_ARM_VCPU_INIT, presumably).

thanks
-- PMM

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-11-24 15:03         ` Peter Maydell
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Maydell @ 2017-11-24 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 November 2017 at 14:45, Christoffer Dall <cdall@linaro.org> wrote:
> On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
>> This creates another issue: if SVE is supported by the host kernel
>> but not enabled for the guest, do I need to hist the SVE regs from
>> KVM_GET_REG_LIST?
>
> I don't think so.  We should check with QEMU and kvmtool, but I think
> the way it works is that userspace matches the KVM list with its own
> internal list, and matches up anything it knows about, and discards the
> rest.  Certainly in the past we haven't been afraid of adding registers
> to KVM_GET_REG_LIST.

QEMU userspace doesn't discard things it doesn't knows about, it
just blindly passes them along in the migration stream to the
destination.

The awkward bit for SVE is that its registers for KVM_GET_REG_LIST
will be bigger than 64 bits. If you do that for existing QEMU
binaries they'll assert because of the check in kvm_arm_init_cpreg_list()
that the register size is something it can handle. So we mustn't
list the SVE regs in GET_REG_LIST unless userspace has told us it
can handle them (by setting a feature bit in the arguments to
KVM_ARM_VCPU_INIT, presumably).

thanks
-- PMM

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] KVM API extensions for SVE
  2017-11-24 14:45       ` Christoffer Dall
@ 2017-11-25 17:40         ` Andrew Jones
  -1 siblings, 0 replies; 18+ messages in thread
From: Andrew Jones @ 2017-11-25 17:40 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Dave Martin, linux-arm-kernel, kvmarm

On Fri, Nov 24, 2017 at 03:45:38PM +0100, Christoffer Dall wrote:
> On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
> > 
> > I don't think it's all that bad, but there doesn't seem to be any
> > example of an optional architecture feature for a guest today --
> > so I wanted to check before setting a precedent here.
> 
> We don't enable the GIC unless userspace asks for it, same with the
> PMU...
>

And this reminds me that we need to apply the new ID register
trapping support Dave added to hiding the PMU when userspace
doesn't enable it. Patch coming in a couple seconds...

drew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-11-25 17:40         ` Andrew Jones
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Jones @ 2017-11-25 17:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 24, 2017 at 03:45:38PM +0100, Christoffer Dall wrote:
> On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
> > 
> > I don't think it's all that bad, but there doesn't seem to be any
> > example of an optional architecture feature for a guest today --
> > so I wanted to check before setting a precedent here.
> 
> We don't enable the GIC unless userspace asks for it, same with the
> PMU...
>

And this reminds me that we need to apply the new ID register
trapping support Dave added to hiding the PMU when userspace
doesn't enable it. Patch coming in a couple seconds...

drew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] KVM API extensions for SVE
  2017-11-24 14:45       ` Christoffer Dall
@ 2017-12-11 14:51         ` Dave Martin
  -1 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2017-12-11 14:51 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, kvmarm, linux-arm-kernel

On Fri, Nov 24, 2017 at 03:45:38PM +0100, Christoffer Dall wrote:
> On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
> > On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> > > Hi Dave,
> > > 
> > > On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > > > Hi all,
> > > > 
> > > > SVE adds some new registers, and their size depends on the hardware ando
> > > > on runtime sysreg settings.
> > > > 
> > > > Before coding something, I'd like to get people's views on my current
> > > > approach here.
> > > > 
> > > > --8<--
> > > > 
> > > > New vcpu feature flag:
> > > > /*
> > > >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> > > >  * and will honour the size field in the reg iD
> > > >  */
> > > > #define KVM_ARM_VCPU_LARGE_REGS	4
> > > > 
> > > > Should we just error out of userspace fails to set this on a system that
> > > > supports SVE, or is that too brutal?  
> > > 
> > > That would break older userspace on newer kernels on certain hardware,
> > > which would certainly not be acceptable.  Or did I misunderstand?
> > 
> > Yes, which is probably bad.
> > 
> > I'm still trying to gauge the policy regarding backwards compatibility.
> > 
> 
> Think if QEMU as any other userspace application.  We should really
> never regress userspace.
> 
> > 
> > So, I guess you're saying is that we should disable SVE for the guest
> > but still run it in this case.
> > 
> 
> That's slightly more debatable, but doing it any other way would break
> any form of migration that relies on the guest configuration of SVE and
> userspace would have no way to know.  I think this sounds like a bad
> idea.
> 
> > 
> > This creates another issue: if SVE is supported by the host kernel
> > but not enabled for the guest, do I need to hist the SVE regs from
> > KVM_GET_REG_LIST?
> 
> I don't think so.  We should check with QEMU and kvmtool, but I think
> the way it works is that userspace matches the KVM list with its own
> internal list, and matches up anything it knows about, and discards the
> rest.  Certainly in the past we haven't been afraid of adding registers
> to KVM_GET_REG_LIST.
> 
> > 
> > If not, a guest that doesn't have SVE created on a host that supports
> > SVE would not be migratable to a host kernel that doesn't support SVE
> > but otherwise could run the guest:  as I understand it, the attempt to
> > restore the SVE regs on the target node would fail because the host
> > kernel doesn't recognise those regs.
> > 
> > Or do we not claim to support backwards compatibility for migration?
> > 
> 
> I think QEMU and higher level tools like libvirt would have the
> knowledge to figure this out and implement what they want, so from the
> kernel's point of view, I think we can simply export the registers when
> SVE is supported.
> 
> > > 
> > > > If we do treat that as an error,
> > > > then we can unconditionally enable SVE for the guest when the host
> > > > supports it -- which cuts down on unnecessary implementation complexity.
> > > 
> > > I think it makes sense to be able to disable SVE, especially if there's
> > > high overhead related to context switching the state.  Since you say the
> > > implementation complexity is unnecessary, I may be missing some larger
> > > point?
> > 
> > I don't think it's all that bad, but there doesn't seem to be any
> > example of an optional architecture feature for a guest today --
> > so I wanted to check before setting a precedent here.
> 
> We don't enable the GIC unless userspace asks for it, same with the
> PMU...
> 
> > 
> > Would "SVE enabled" be better as an attribute, rather than a
> > feature, or does it not really matter?
> > 
> 
> Doesn't matter.  It's a question of what you need in terms of the ABI.
> 
> > > > 
> > > > Alternatively, we would need logic to disable SVE if userspace is too
> > > > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > > > the flag is set the same on every vcpu -- but from the kernel's PoV
> > > > it probably doesn't matter.
> > > 
> > > Not sure I understand why it doesn't matter from the kernel's PoV.
> > > 
> > > I think SVE should be disabled by default (as it is now) and then we
> > > should expose a capability (potentially simply via the vcpu attributes
> > > being present), and let userspace enable SVE and set a vector length.
> > 
> > Yes, but aren't all the attributes/features per-vcpu?
> > 
> 
> Yes, so the kernel should check that everything is configured
> consistently across all VCPUs.
> 
> > > It makes sense that userspace needs to know about SVE for VMs to use it,
> > > doesn't it?
> > 
> > Yes and no.  Except for debug purposes I don't see why userspace needs
> > to know anything execpt how to handle large registers through the ioctl
> > interface.
> > 
> 
> Migration is another reason.
> 
> > > I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> > > being way too optimistic about the ARM ecosystem here?  Just like we
> > > don't model big.LITTLE, I think we should enforce in the kernel, that
> > 
> > The kernel follows the same policy: if SVE is not present on all
> > physical CPUs we disable it completely and hide it from guests and
> > userspace.
> > 
> > For vector length I'm a little more permissive: the max vector length
> > would be clamped to the minimum commonly supported vector length.
> > 
> 
> Ok, so KVM could implement the same.  Or we could just be reasonable and
> require userspace to configure all VCPUs the same.
> 
> > > userspace configures all VCPUs with the same SVE properties.
> > 
> > OK, so long as you think it's not superfluous to do it, then I'm happy
> > to do it.
> > 
> > > > 
> > > > /*
> > > >  * For the SVE regs, we add some new reg IDs.
> > > >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> > > >  * slices.  This is sufficient for only a single slice to be required
> > > >  * per register for SVE, but ensures expansion room in case future arch
> > > >  * versions grow the maximum size.
> > > >  */
> > > 
> > > I don't understand the last part of this comment?
> > 
> > This may be explained better in by response below.
> > 
> > > > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> > > 
> > > Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> > > 
> > > > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > > > 		((n) << 5) | (i))
> > > > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > > > 		(((n) + 32) << 5) | (i))
> > > > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > > > 	 KVM_REG_ARM64_SVE_P(16, i)
> > > > 
> > > > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > > > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > > > 
> > > 
> > > This is hard to read and understand the way presented here.  I would
> > > suggest you formulate this suggestion in the form of an RFC patch to
> > > Documentation/virtual/kvm/api.txt plus the header definitions.
> > 
> > Sure, I hadn't figured out the best way to present this: I was thinking
> > aloud.
> > 
> > > (I'm not sure where to look to look to decode the "<< 5" and the
> > > " + 32) << 5)" stuff above.
> > 
> > The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
> > They are numbered serially.
> > 
> > However, the SVE architecture leaves the possibility of future
> > expansion open, up to 32 times the current maximum.
> > 
> > The KVM reg API doesn't support such ridiculously huge registers,
> > so my proposal is to slice them up, indexed by the value in the
> > bottom 5 bits of the reg ID.  This requires the "register ID"
> > field to be shifted up by 5 bits.
> > 
> > If the regs are not oversized (never, for the current architecture),
> > then we simply don't list those extra slices via KVM_GET_REG_LIST.
> > 
> > > > Bits above the max vector length could be
> > > 
> > > Which bits are these and where are they, and why do we have them?
> > 
> > The KVM register size via ioctl is fixed at 2048 bits here.  Since
> > the system might not support vectors that large, then bits 2047:1024
> > in the ioctl payload wouldn't map to any register bits in the hardware.
> > Should KVM still store them somewhere?  Should they logically exist
> > for the purpose of the ioctl() API?
> > 
> > Making the size dynamic to avoid surplus bits doesn't work, because KVM
> > only supports power-of-two reg sizes, whereas SVE can support non-power-
> > of-two sizes.
> > 
> > > Is the max vector length the max architecturally (current version)
> > > defined length, or what is chosen for this VM?
> > 
> > For now, that's an open question.
> > 
> > > >  * don't care (or not copied at all) on read; ignored on write
> > > >  * zero on read; ignored on write
> > > >  * zero on read; must be zero on write
> > > > 
> > > > Bits between the current and max vector length are trickier to specify:
> > > > the "current" vector length for ioctl access is ill-defined, because
> > > > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > > > and access to ZCR_EL1.
> > > 
> > > I think you want userspace to be able to read/write these values in any
> > > order compared to configuring SVE for the VM, and then fix up whatever
> > > needs masking etc. in the kernel later, if possible.  Ordering
> > > requirements to userspace accesses have shown to be hard to enforce and
> > > get right in the past.
> > 
> > So I've heard from other people.
> > 
> > > What happens on hardware if you give a certain vector length to EL0, EL0
> > > writes a value of the full length, and then EL1 restricts the length to
> > > something smaller, and subsequently expands it again?  Is the original
> > > full value visible or are some bits potentially lost?  IOW, can't we
> > > rely on what the architecture says here?
> > 
> > The architecture says that bits that are hidden and then revealed again
> > are either preserved whilst hidden, or zeroed.
> > 
> > Opinion differs on whether that's a good thing to expose in ABI: Will
> > considered it unacceptable to expose this kind of behaviour around
> > syscalls from userspace for example, so I currently always zero the
> > bits in that case even though it's slightly more expensive.
> > 
> > The concern here was that userspace might unintentionally rely on
> > the affected register bits being preserved around syscalls when this
> > is not guaranteed by the implementation.
> > 
> > This does not mean that similar design considerations apply to the KVM
> > ioctl interface though.
> > 
> 
> It sounds to me that the most simple thing is that the register
> interface to userspace exposes the full possible register width in both
> directions, and we apply a mask whenever we need to.
> 
> > > > 
> > > > So, it may be simpler to expose the full maximum supported vector size
> > > > unconditionally through ioctl, and pack/unpack as necessary.
> > > 
> > > yes, I think this was what I tried to say.
> > > 
> > > > 
> > > > Currently, data is packed in the vcpu struct in a vector length
> > > > dependent format, since this seems optimal for low-level save/restore,
> > > > so there will be potential data loss / zero padding when converting.
> > > > 
> > > > This may cause some unexpected effects.  For example:
> > > > 
> > > > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > > > /* Guest's current vector length will be 128 bits when started */
> > > > 
> > > > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > > > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > > > 
> > > > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > > > 
> > > > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > > > 
> > > 
> > > I really don't know how to parse this or what the point here is?  Sorry.
> > 
> > It means that for the ioctl interface, "obvious" guarantees like "if you
> > read a register you get back the last value written" don't work quite as
> > expected.  Some bits may have disappeared, or not, depending on the
> > precise scenario.
> > 
> > > > 
> > > > Since the guest should be treated mostly as a black box, I'm not sure
> > > > how big a deal this is.  The guest _might_ have explicitly set those
> > > > bits to 0... who are we to say?
> > > 
> > > How big a deal what is?  I'm lost.
> > > 
> > > > 
> > > > Can anyone think of a scenario where effects like this would matter?
> > > > 
> > > 
> > > I think we need more information.
> > 
> > See if my comments above throw any light on this.
> > 
> 
> So you're saying even if we try the "expose full width and read back
> hidden values" approach, those hidden values may be changed when
> executing the guest, due to the KVM implementation or the way hardware
> works, is that the point?

Basically yes.

> I think the KVM interface should be designed similarly to being able to
> probe a hardware CPU's register state at various stages of execution.
> 
> So, for example, if you write content to hidden bits in the SVE
> registers from EL2 on real hardware and limit the length using ZCR_EL2,
> and then run a bunch of code in EL1/0, and then come back to EL2 and
> examine the registers again, then we should model that behavior in
> software.
> 
> In other words, I think we have to model this more closely to what
> guarantees ZCR_EL2 gives us, and not ZCR_EL1, and choose something
> architecturally compliant which is reasonable to implement.

So, we imagine that provided the vcpu is not run in the meantime,
all accesses to SVE regs via the KVM reg API act like they are executed
at EL2?

That doesn't seem unreasonable, and it removes any ordering requirement
between ZCR_EL1 and the SVE regs providing that the vcpu isn't set
running in the meantime.  There is no userspace access to ZCR_EL2 at
all, if we go with the model of configuring that via attributes that
must be configured before vcpu startup -- in which case there is no
ordering requirement there.

The extra bits beyond ZCR_EL1.LEN may disappear as soon as the vcpu
is run, but that is architecturally consistent behaviour at least.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-12-11 14:51         ` Dave Martin
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2017-12-11 14:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 24, 2017 at 03:45:38PM +0100, Christoffer Dall wrote:
> On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
> > On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> > > Hi Dave,
> > > 
> > > On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > > > Hi all,
> > > > 
> > > > SVE adds some new registers, and their size depends on the hardware ando
> > > > on runtime sysreg settings.
> > > > 
> > > > Before coding something, I'd like to get people's views on my current
> > > > approach here.
> > > > 
> > > > --8<--
> > > > 
> > > > New vcpu feature flag:
> > > > /*
> > > >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> > > >  * and will honour the size field in the reg iD
> > > >  */
> > > > #define KVM_ARM_VCPU_LARGE_REGS	4
> > > > 
> > > > Should we just error out of userspace fails to set this on a system that
> > > > supports SVE, or is that too brutal?  
> > > 
> > > That would break older userspace on newer kernels on certain hardware,
> > > which would certainly not be acceptable.  Or did I misunderstand?
> > 
> > Yes, which is probably bad.
> > 
> > I'm still trying to gauge the policy regarding backwards compatibility.
> > 
> 
> Think if QEMU as any other userspace application.  We should really
> never regress userspace.
> 
> > 
> > So, I guess you're saying is that we should disable SVE for the guest
> > but still run it in this case.
> > 
> 
> That's slightly more debatable, but doing it any other way would break
> any form of migration that relies on the guest configuration of SVE and
> userspace would have no way to know.  I think this sounds like a bad
> idea.
> 
> > 
> > This creates another issue: if SVE is supported by the host kernel
> > but not enabled for the guest, do I need to hist the SVE regs from
> > KVM_GET_REG_LIST?
> 
> I don't think so.  We should check with QEMU and kvmtool, but I think
> the way it works is that userspace matches the KVM list with its own
> internal list, and matches up anything it knows about, and discards the
> rest.  Certainly in the past we haven't been afraid of adding registers
> to KVM_GET_REG_LIST.
> 
> > 
> > If not, a guest that doesn't have SVE created on a host that supports
> > SVE would not be migratable to a host kernel that doesn't support SVE
> > but otherwise could run the guest:  as I understand it, the attempt to
> > restore the SVE regs on the target node would fail because the host
> > kernel doesn't recognise those regs.
> > 
> > Or do we not claim to support backwards compatibility for migration?
> > 
> 
> I think QEMU and higher level tools like libvirt would have the
> knowledge to figure this out and implement what they want, so from the
> kernel's point of view, I think we can simply export the registers when
> SVE is supported.
> 
> > > 
> > > > If we do treat that as an error,
> > > > then we can unconditionally enable SVE for the guest when the host
> > > > supports it -- which cuts down on unnecessary implementation complexity.
> > > 
> > > I think it makes sense to be able to disable SVE, especially if there's
> > > high overhead related to context switching the state.  Since you say the
> > > implementation complexity is unnecessary, I may be missing some larger
> > > point?
> > 
> > I don't think it's all that bad, but there doesn't seem to be any
> > example of an optional architecture feature for a guest today --
> > so I wanted to check before setting a precedent here.
> 
> We don't enable the GIC unless userspace asks for it, same with the
> PMU...
> 
> > 
> > Would "SVE enabled" be better as an attribute, rather than a
> > feature, or does it not really matter?
> > 
> 
> Doesn't matter.  It's a question of what you need in terms of the ABI.
> 
> > > > 
> > > > Alternatively, we would need logic to disable SVE if userspace is too
> > > > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > > > the flag is set the same on every vcpu -- but from the kernel's PoV
> > > > it probably doesn't matter.
> > > 
> > > Not sure I understand why it doesn't matter from the kernel's PoV.
> > > 
> > > I think SVE should be disabled by default (as it is now) and then we
> > > should expose a capability (potentially simply via the vcpu attributes
> > > being present), and let userspace enable SVE and set a vector length.
> > 
> > Yes, but aren't all the attributes/features per-vcpu?
> > 
> 
> Yes, so the kernel should check that everything is configured
> consistently across all VCPUs.
> 
> > > It makes sense that userspace needs to know about SVE for VMs to use it,
> > > doesn't it?
> > 
> > Yes and no.  Except for debug purposes I don't see why userspace needs
> > to know anything execpt how to handle large registers through the ioctl
> > interface.
> > 
> 
> Migration is another reason.
> 
> > > I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> > > being way too optimistic about the ARM ecosystem here?  Just like we
> > > don't model big.LITTLE, I think we should enforce in the kernel, that
> > 
> > The kernel follows the same policy: if SVE is not present on all
> > physical CPUs we disable it completely and hide it from guests and
> > userspace.
> > 
> > For vector length I'm a little more permissive: the max vector length
> > would be clamped to the minimum commonly supported vector length.
> > 
> 
> Ok, so KVM could implement the same.  Or we could just be reasonable and
> require userspace to configure all VCPUs the same.
> 
> > > userspace configures all VCPUs with the same SVE properties.
> > 
> > OK, so long as you think it's not superfluous to do it, then I'm happy
> > to do it.
> > 
> > > > 
> > > > /*
> > > >  * For the SVE regs, we add some new reg IDs.
> > > >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> > > >  * slices.  This is sufficient for only a single slice to be required
> > > >  * per register for SVE, but ensures expansion room in case future arch
> > > >  * versions grow the maximum size.
> > > >  */
> > > 
> > > I don't understand the last part of this comment?
> > 
> > This may be explained better in by response below.
> > 
> > > > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> > > 
> > > Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> > > 
> > > > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > > > 		((n) << 5) | (i))
> > > > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > > > 		(((n) + 32) << 5) | (i))
> > > > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > > > 	 KVM_REG_ARM64_SVE_P(16, i)
> > > > 
> > > > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > > > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > > > 
> > > 
> > > This is hard to read and understand the way presented here.  I would
> > > suggest you formulate this suggestion in the form of an RFC patch to
> > > Documentation/virtual/kvm/api.txt plus the header definitions.
> > 
> > Sure, I hadn't figured out the best way to present this: I was thinking
> > aloud.
> > 
> > > (I'm not sure where to look to look to decode the "<< 5" and the
> > > " + 32) << 5)" stuff above.
> > 
> > The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
> > They are numbered serially.
> > 
> > However, the SVE architecture leaves the possibility of future
> > expansion open, up to 32 times the current maximum.
> > 
> > The KVM reg API doesn't support such ridiculously huge registers,
> > so my proposal is to slice them up, indexed by the value in the
> > bottom 5 bits of the reg ID.  This requires the "register ID"
> > field to be shifted up by 5 bits.
> > 
> > If the regs are not oversized (never, for the current architecture),
> > then we simply don't list those extra slices via KVM_GET_REG_LIST.
> > 
> > > > Bits above the max vector length could be
> > > 
> > > Which bits are these and where are they, and why do we have them?
> > 
> > The KVM register size via ioctl is fixed at 2048 bits here.  Since
> > the system might not support vectors that large, then bits 2047:1024
> > in the ioctl payload wouldn't map to any register bits in the hardware.
> > Should KVM still store them somewhere?  Should they logically exist
> > for the purpose of the ioctl() API?
> > 
> > Making the size dynamic to avoid surplus bits doesn't work, because KVM
> > only supports power-of-two reg sizes, whereas SVE can support non-power-
> > of-two sizes.
> > 
> > > Is the max vector length the max architecturally (current version)
> > > defined length, or what is chosen for this VM?
> > 
> > For now, that's an open question.
> > 
> > > >  * don't care (or not copied at all) on read; ignored on write
> > > >  * zero on read; ignored on write
> > > >  * zero on read; must be zero on write
> > > > 
> > > > Bits between the current and max vector length are trickier to specify:
> > > > the "current" vector length for ioctl access is ill-defined, because
> > > > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > > > and access to ZCR_EL1.
> > > 
> > > I think you want userspace to be able to read/write these values in any
> > > order compared to configuring SVE for the VM, and then fix up whatever
> > > needs masking etc. in the kernel later, if possible.  Ordering
> > > requirements to userspace accesses have shown to be hard to enforce and
> > > get right in the past.
> > 
> > So I've heard from other people.
> > 
> > > What happens on hardware if you give a certain vector length to EL0, EL0
> > > writes a value of the full length, and then EL1 restricts the length to
> > > something smaller, and subsequently expands it again?  Is the original
> > > full value visible or are some bits potentially lost?  IOW, can't we
> > > rely on what the architecture says here?
> > 
> > The architecture says that bits that are hidden and then revealed again
> > are either preserved whilst hidden, or zeroed.
> > 
> > Opinion differs on whether that's a good thing to expose in ABI: Will
> > considered it unacceptable to expose this kind of behaviour around
> > syscalls from userspace for example, so I currently always zero the
> > bits in that case even though it's slightly more expensive.
> > 
> > The concern here was that userspace might unintentionally rely on
> > the affected register bits being preserved around syscalls when this
> > is not guaranteed by the implementation.
> > 
> > This does not mean that similar design considerations apply to the KVM
> > ioctl interface though.
> > 
> 
> It sounds to me that the most simple thing is that the register
> interface to userspace exposes the full possible register width in both
> directions, and we apply a mask whenever we need to.
> 
> > > > 
> > > > So, it may be simpler to expose the full maximum supported vector size
> > > > unconditionally through ioctl, and pack/unpack as necessary.
> > > 
> > > yes, I think this was what I tried to say.
> > > 
> > > > 
> > > > Currently, data is packed in the vcpu struct in a vector length
> > > > dependent format, since this seems optimal for low-level save/restore,
> > > > so there will be potential data loss / zero padding when converting.
> > > > 
> > > > This may cause some unexpected effects.  For example:
> > > > 
> > > > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > > > /* Guest's current vector length will be 128 bits when started */
> > > > 
> > > > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > > > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > > > 
> > > > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > > > 
> > > > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > > > 
> > > 
> > > I really don't know how to parse this or what the point here is?  Sorry.
> > 
> > It means that for the ioctl interface, "obvious" guarantees like "if you
> > read a register you get back the last value written" don't work quite as
> > expected.  Some bits may have disappeared, or not, depending on the
> > precise scenario.
> > 
> > > > 
> > > > Since the guest should be treated mostly as a black box, I'm not sure
> > > > how big a deal this is.  The guest _might_ have explicitly set those
> > > > bits to 0... who are we to say?
> > > 
> > > How big a deal what is?  I'm lost.
> > > 
> > > > 
> > > > Can anyone think of a scenario where effects like this would matter?
> > > > 
> > > 
> > > I think we need more information.
> > 
> > See if my comments above throw any light on this.
> > 
> 
> So you're saying even if we try the "expose full width and read back
> hidden values" approach, those hidden values may be changed when
> executing the guest, due to the KVM implementation or the way hardware
> works, is that the point?

Basically yes.

> I think the KVM interface should be designed similarly to being able to
> probe a hardware CPU's register state at various stages of execution.
> 
> So, for example, if you write content to hidden bits in the SVE
> registers from EL2 on real hardware and limit the length using ZCR_EL2,
> and then run a bunch of code in EL1/0, and then come back to EL2 and
> examine the registers again, then we should model that behavior in
> software.
> 
> In other words, I think we have to model this more closely to what
> guarantees ZCR_EL2 gives us, and not ZCR_EL1, and choose something
> architecturally compliant which is reasonable to implement.

So, we imagine that provided the vcpu is not run in the meantime,
all accesses to SVE regs via the KVM reg API act like they are executed
at EL2?

That doesn't seem unreasonable, and it removes any ordering requirement
between ZCR_EL1 and the SVE regs providing that the vcpu isn't set
running in the meantime.  There is no userspace access to ZCR_EL2 at
all, if we go with the model of configuring that via attributes that
must be configured before vcpu startup -- in which case there is no
ordering requirement there.

The extra bits beyond ZCR_EL1.LEN may disappear as soon as the vcpu
is run, but that is architecturally consistent behaviour at least.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] KVM API extensions for SVE
  2017-12-11 14:51         ` Dave Martin
@ 2017-12-11 19:24           ` Christoffer Dall
  -1 siblings, 0 replies; 18+ messages in thread
From: Christoffer Dall @ 2017-12-11 19:24 UTC (permalink / raw)
  To: Dave Martin; +Cc: Marc Zyngier, Christoffer Dall, kvmarm, linux-arm-kernel

On Mon, Dec 11, 2017 at 02:51:36PM +0000, Dave Martin wrote:
> On Fri, Nov 24, 2017 at 03:45:38PM +0100, Christoffer Dall wrote:
> > On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
> > > On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> > > > Hi Dave,
> > > > 
> > > > On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > > > > Hi all,
> > > > > 
> > > > > SVE adds some new registers, and their size depends on the hardware ando
> > > > > on runtime sysreg settings.
> > > > > 
> > > > > Before coding something, I'd like to get people's views on my current
> > > > > approach here.
> > > > > 
> > > > > --8<--
> > > > > 
> > > > > New vcpu feature flag:
> > > > > /*
> > > > >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> > > > >  * and will honour the size field in the reg iD
> > > > >  */
> > > > > #define KVM_ARM_VCPU_LARGE_REGS	4
> > > > > 
> > > > > Should we just error out of userspace fails to set this on a system that
> > > > > supports SVE, or is that too brutal?  
> > > > 
> > > > That would break older userspace on newer kernels on certain hardware,
> > > > which would certainly not be acceptable.  Or did I misunderstand?
> > > 
> > > Yes, which is probably bad.
> > > 
> > > I'm still trying to gauge the policy regarding backwards compatibility.
> > > 
> > 
> > Think if QEMU as any other userspace application.  We should really
> > never regress userspace.
> > 
> > > 
> > > So, I guess you're saying is that we should disable SVE for the guest
> > > but still run it in this case.
> > > 
> > 
> > That's slightly more debatable, but doing it any other way would break
> > any form of migration that relies on the guest configuration of SVE and
> > userspace would have no way to know.  I think this sounds like a bad
> > idea.
> > 
> > > 
> > > This creates another issue: if SVE is supported by the host kernel
> > > but not enabled for the guest, do I need to hist the SVE regs from
> > > KVM_GET_REG_LIST?
> > 
> > I don't think so.  We should check with QEMU and kvmtool, but I think
> > the way it works is that userspace matches the KVM list with its own
> > internal list, and matches up anything it knows about, and discards the
> > rest.  Certainly in the past we haven't been afraid of adding registers
> > to KVM_GET_REG_LIST.
> > 
> > > 
> > > If not, a guest that doesn't have SVE created on a host that supports
> > > SVE would not be migratable to a host kernel that doesn't support SVE
> > > but otherwise could run the guest:  as I understand it, the attempt to
> > > restore the SVE regs on the target node would fail because the host
> > > kernel doesn't recognise those regs.
> > > 
> > > Or do we not claim to support backwards compatibility for migration?
> > > 
> > 
> > I think QEMU and higher level tools like libvirt would have the
> > knowledge to figure this out and implement what they want, so from the
> > kernel's point of view, I think we can simply export the registers when
> > SVE is supported.
> > 
> > > > 
> > > > > If we do treat that as an error,
> > > > > then we can unconditionally enable SVE for the guest when the host
> > > > > supports it -- which cuts down on unnecessary implementation complexity.
> > > > 
> > > > I think it makes sense to be able to disable SVE, especially if there's
> > > > high overhead related to context switching the state.  Since you say the
> > > > implementation complexity is unnecessary, I may be missing some larger
> > > > point?
> > > 
> > > I don't think it's all that bad, but there doesn't seem to be any
> > > example of an optional architecture feature for a guest today --
> > > so I wanted to check before setting a precedent here.
> > 
> > We don't enable the GIC unless userspace asks for it, same with the
> > PMU...
> > 
> > > 
> > > Would "SVE enabled" be better as an attribute, rather than a
> > > feature, or does it not really matter?
> > > 
> > 
> > Doesn't matter.  It's a question of what you need in terms of the ABI.
> > 
> > > > > 
> > > > > Alternatively, we would need logic to disable SVE if userspace is too
> > > > > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > > > > the flag is set the same on every vcpu -- but from the kernel's PoV
> > > > > it probably doesn't matter.
> > > > 
> > > > Not sure I understand why it doesn't matter from the kernel's PoV.
> > > > 
> > > > I think SVE should be disabled by default (as it is now) and then we
> > > > should expose a capability (potentially simply via the vcpu attributes
> > > > being present), and let userspace enable SVE and set a vector length.
> > > 
> > > Yes, but aren't all the attributes/features per-vcpu?
> > > 
> > 
> > Yes, so the kernel should check that everything is configured
> > consistently across all VCPUs.
> > 
> > > > It makes sense that userspace needs to know about SVE for VMs to use it,
> > > > doesn't it?
> > > 
> > > Yes and no.  Except for debug purposes I don't see why userspace needs
> > > to know anything execpt how to handle large registers through the ioctl
> > > interface.
> > > 
> > 
> > Migration is another reason.
> > 
> > > > I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> > > > being way too optimistic about the ARM ecosystem here?  Just like we
> > > > don't model big.LITTLE, I think we should enforce in the kernel, that
> > > 
> > > The kernel follows the same policy: if SVE is not present on all
> > > physical CPUs we disable it completely and hide it from guests and
> > > userspace.
> > > 
> > > For vector length I'm a little more permissive: the max vector length
> > > would be clamped to the minimum commonly supported vector length.
> > > 
> > 
> > Ok, so KVM could implement the same.  Or we could just be reasonable and
> > require userspace to configure all VCPUs the same.
> > 
> > > > userspace configures all VCPUs with the same SVE properties.
> > > 
> > > OK, so long as you think it's not superfluous to do it, then I'm happy
> > > to do it.
> > > 
> > > > > 
> > > > > /*
> > > > >  * For the SVE regs, we add some new reg IDs.
> > > > >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> > > > >  * slices.  This is sufficient for only a single slice to be required
> > > > >  * per register for SVE, but ensures expansion room in case future arch
> > > > >  * versions grow the maximum size.
> > > > >  */
> > > > 
> > > > I don't understand the last part of this comment?
> > > 
> > > This may be explained better in by response below.
> > > 
> > > > > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> > > > 
> > > > Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> > > > 
> > > > > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > > > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > > > > 		((n) << 5) | (i))
> > > > > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > > > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > > > > 		(((n) + 32) << 5) | (i))
> > > > > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > > > > 	 KVM_REG_ARM64_SVE_P(16, i)
> > > > > 
> > > > > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > > > > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > > > > 
> > > > 
> > > > This is hard to read and understand the way presented here.  I would
> > > > suggest you formulate this suggestion in the form of an RFC patch to
> > > > Documentation/virtual/kvm/api.txt plus the header definitions.
> > > 
> > > Sure, I hadn't figured out the best way to present this: I was thinking
> > > aloud.
> > > 
> > > > (I'm not sure where to look to look to decode the "<< 5" and the
> > > > " + 32) << 5)" stuff above.
> > > 
> > > The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
> > > They are numbered serially.
> > > 
> > > However, the SVE architecture leaves the possibility of future
> > > expansion open, up to 32 times the current maximum.
> > > 
> > > The KVM reg API doesn't support such ridiculously huge registers,
> > > so my proposal is to slice them up, indexed by the value in the
> > > bottom 5 bits of the reg ID.  This requires the "register ID"
> > > field to be shifted up by 5 bits.
> > > 
> > > If the regs are not oversized (never, for the current architecture),
> > > then we simply don't list those extra slices via KVM_GET_REG_LIST.
> > > 
> > > > > Bits above the max vector length could be
> > > > 
> > > > Which bits are these and where are they, and why do we have them?
> > > 
> > > The KVM register size via ioctl is fixed at 2048 bits here.  Since
> > > the system might not support vectors that large, then bits 2047:1024
> > > in the ioctl payload wouldn't map to any register bits in the hardware.
> > > Should KVM still store them somewhere?  Should they logically exist
> > > for the purpose of the ioctl() API?
> > > 
> > > Making the size dynamic to avoid surplus bits doesn't work, because KVM
> > > only supports power-of-two reg sizes, whereas SVE can support non-power-
> > > of-two sizes.
> > > 
> > > > Is the max vector length the max architecturally (current version)
> > > > defined length, or what is chosen for this VM?
> > > 
> > > For now, that's an open question.
> > > 
> > > > >  * don't care (or not copied at all) on read; ignored on write
> > > > >  * zero on read; ignored on write
> > > > >  * zero on read; must be zero on write
> > > > > 
> > > > > Bits between the current and max vector length are trickier to specify:
> > > > > the "current" vector length for ioctl access is ill-defined, because
> > > > > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > > > > and access to ZCR_EL1.
> > > > 
> > > > I think you want userspace to be able to read/write these values in any
> > > > order compared to configuring SVE for the VM, and then fix up whatever
> > > > needs masking etc. in the kernel later, if possible.  Ordering
> > > > requirements to userspace accesses have shown to be hard to enforce and
> > > > get right in the past.
> > > 
> > > So I've heard from other people.
> > > 
> > > > What happens on hardware if you give a certain vector length to EL0, EL0
> > > > writes a value of the full length, and then EL1 restricts the length to
> > > > something smaller, and subsequently expands it again?  Is the original
> > > > full value visible or are some bits potentially lost?  IOW, can't we
> > > > rely on what the architecture says here?
> > > 
> > > The architecture says that bits that are hidden and then revealed again
> > > are either preserved whilst hidden, or zeroed.
> > > 
> > > Opinion differs on whether that's a good thing to expose in ABI: Will
> > > considered it unacceptable to expose this kind of behaviour around
> > > syscalls from userspace for example, so I currently always zero the
> > > bits in that case even though it's slightly more expensive.
> > > 
> > > The concern here was that userspace might unintentionally rely on
> > > the affected register bits being preserved around syscalls when this
> > > is not guaranteed by the implementation.
> > > 
> > > This does not mean that similar design considerations apply to the KVM
> > > ioctl interface though.
> > > 
> > 
> > It sounds to me that the most simple thing is that the register
> > interface to userspace exposes the full possible register width in both
> > directions, and we apply a mask whenever we need to.
> > 
> > > > > 
> > > > > So, it may be simpler to expose the full maximum supported vector size
> > > > > unconditionally through ioctl, and pack/unpack as necessary.
> > > > 
> > > > yes, I think this was what I tried to say.
> > > > 
> > > > > 
> > > > > Currently, data is packed in the vcpu struct in a vector length
> > > > > dependent format, since this seems optimal for low-level save/restore,
> > > > > so there will be potential data loss / zero padding when converting.
> > > > > 
> > > > > This may cause some unexpected effects.  For example:
> > > > > 
> > > > > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > > > > /* Guest's current vector length will be 128 bits when started */
> > > > > 
> > > > > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > > > > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > > > > 
> > > > > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > > > > 
> > > > > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > > > > 
> > > > 
> > > > I really don't know how to parse this or what the point here is?  Sorry.
> > > 
> > > It means that for the ioctl interface, "obvious" guarantees like "if you
> > > read a register you get back the last value written" don't work quite as
> > > expected.  Some bits may have disappeared, or not, depending on the
> > > precise scenario.
> > > 
> > > > > 
> > > > > Since the guest should be treated mostly as a black box, I'm not sure
> > > > > how big a deal this is.  The guest _might_ have explicitly set those
> > > > > bits to 0... who are we to say?
> > > > 
> > > > How big a deal what is?  I'm lost.
> > > > 
> > > > > 
> > > > > Can anyone think of a scenario where effects like this would matter?
> > > > > 
> > > > 
> > > > I think we need more information.
> > > 
> > > See if my comments above throw any light on this.
> > > 
> > 
> > So you're saying even if we try the "expose full width and read back
> > hidden values" approach, those hidden values may be changed when
> > executing the guest, due to the KVM implementation or the way hardware
> > works, is that the point?
> 
> Basically yes.
> 
> > I think the KVM interface should be designed similarly to being able to
> > probe a hardware CPU's register state at various stages of execution.
> > 
> > So, for example, if you write content to hidden bits in the SVE
> > registers from EL2 on real hardware and limit the length using ZCR_EL2,
> > and then run a bunch of code in EL1/0, and then come back to EL2 and
> > examine the registers again, then we should model that behavior in
> > software.
> > 
> > In other words, I think we have to model this more closely to what
> > guarantees ZCR_EL2 gives us, and not ZCR_EL1, and choose something
> > architecturally compliant which is reasonable to implement.
> 
> So, we imagine that provided the vcpu is not run in the meantime,
> all accesses to SVE regs via the KVM reg API act like they are executed
> at EL2?

Yes, userspace probing virtual EL1 state should be like EL2 probing EL1
state on hardware.

> 
> That doesn't seem unreasonable, and it removes any ordering requirement
> between ZCR_EL1 and the SVE regs providing that the vcpu isn't set
> running in the meantime.  There is no userspace access to ZCR_EL2 at
> all, if we go with the model of configuring that via attributes that
> must be configured before vcpu startup -- in which case there is no
> ordering requirement there.
> 
> The extra bits beyond ZCR_EL1.LEN may disappear as soon as the vcpu
> is run, but that is architecturally consistent behaviour at least.
> 

Yes, I think we agree here.  It will all be interesting with nested
virtualization where we have to start exposing ZCR_EL2, but that's not
for today.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-12-11 19:24           ` Christoffer Dall
  0 siblings, 0 replies; 18+ messages in thread
From: Christoffer Dall @ 2017-12-11 19:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 11, 2017 at 02:51:36PM +0000, Dave Martin wrote:
> On Fri, Nov 24, 2017 at 03:45:38PM +0100, Christoffer Dall wrote:
> > On Thu, Nov 23, 2017 at 06:40:50PM +0000, Dave Martin wrote:
> > > On Wed, Nov 22, 2017 at 08:52:30PM +0100, Christoffer Dall wrote:
> > > > Hi Dave,
> > > > 
> > > > On Tue, Nov 21, 2017 at 01:49:16PM +0000, Dave Martin wrote:
> > > > > Hi all,
> > > > > 
> > > > > SVE adds some new registers, and their size depends on the hardware ando
> > > > > on runtime sysreg settings.
> > > > > 
> > > > > Before coding something, I'd like to get people's views on my current
> > > > > approach here.
> > > > > 
> > > > > --8<--
> > > > > 
> > > > > New vcpu feature flag:
> > > > > /*
> > > > >  * userspace can support regs up to at least 2048 bits in size via ioctl,
> > > > >  * and will honour the size field in the reg iD
> > > > >  */
> > > > > #define KVM_ARM_VCPU_LARGE_REGS	4
> > > > > 
> > > > > Should we just error out of userspace fails to set this on a system that
> > > > > supports SVE, or is that too brutal?  
> > > > 
> > > > That would break older userspace on newer kernels on certain hardware,
> > > > which would certainly not be acceptable.  Or did I misunderstand?
> > > 
> > > Yes, which is probably bad.
> > > 
> > > I'm still trying to gauge the policy regarding backwards compatibility.
> > > 
> > 
> > Think if QEMU as any other userspace application.  We should really
> > never regress userspace.
> > 
> > > 
> > > So, I guess you're saying is that we should disable SVE for the guest
> > > but still run it in this case.
> > > 
> > 
> > That's slightly more debatable, but doing it any other way would break
> > any form of migration that relies on the guest configuration of SVE and
> > userspace would have no way to know.  I think this sounds like a bad
> > idea.
> > 
> > > 
> > > This creates another issue: if SVE is supported by the host kernel
> > > but not enabled for the guest, do I need to hist the SVE regs from
> > > KVM_GET_REG_LIST?
> > 
> > I don't think so.  We should check with QEMU and kvmtool, but I think
> > the way it works is that userspace matches the KVM list with its own
> > internal list, and matches up anything it knows about, and discards the
> > rest.  Certainly in the past we haven't been afraid of adding registers
> > to KVM_GET_REG_LIST.
> > 
> > > 
> > > If not, a guest that doesn't have SVE created on a host that supports
> > > SVE would not be migratable to a host kernel that doesn't support SVE
> > > but otherwise could run the guest:  as I understand it, the attempt to
> > > restore the SVE regs on the target node would fail because the host
> > > kernel doesn't recognise those regs.
> > > 
> > > Or do we not claim to support backwards compatibility for migration?
> > > 
> > 
> > I think QEMU and higher level tools like libvirt would have the
> > knowledge to figure this out and implement what they want, so from the
> > kernel's point of view, I think we can simply export the registers when
> > SVE is supported.
> > 
> > > > 
> > > > > If we do treat that as an error,
> > > > > then we can unconditionally enable SVE for the guest when the host
> > > > > supports it -- which cuts down on unnecessary implementation complexity.
> > > > 
> > > > I think it makes sense to be able to disable SVE, especially if there's
> > > > high overhead related to context switching the state.  Since you say the
> > > > implementation complexity is unnecessary, I may be missing some larger
> > > > point?
> > > 
> > > I don't think it's all that bad, but there doesn't seem to be any
> > > example of an optional architecture feature for a guest today --
> > > so I wanted to check before setting a precedent here.
> > 
> > We don't enable the GIC unless userspace asks for it, same with the
> > PMU...
> > 
> > > 
> > > Would "SVE enabled" be better as an attribute, rather than a
> > > feature, or does it not really matter?
> > > 
> > 
> > Doesn't matter.  It's a question of what you need in terms of the ABI.
> > 
> > > > > 
> > > > > Alternatively, we would need logic to disable SVE if userspace is too
> > > > > old, i.e., doesn't set this flag.  Then we might need to enforce that
> > > > > the flag is set the same on every vcpu -- but from the kernel's PoV
> > > > > it probably doesn't matter.
> > > > 
> > > > Not sure I understand why it doesn't matter from the kernel's PoV.
> > > > 
> > > > I think SVE should be disabled by default (as it is now) and then we
> > > > should expose a capability (potentially simply via the vcpu attributes
> > > > being present), and let userspace enable SVE and set a vector length.
> > > 
> > > Yes, but aren't all the attributes/features per-vcpu?
> > > 
> > 
> > Yes, so the kernel should check that everything is configured
> > consistently across all VCPUs.
> > 
> > > > It makes sense that userspace needs to know about SVE for VMs to use it,
> > > > doesn't it?
> > > 
> > > Yes and no.  Except for debug purposes I don't see why userspace needs
> > > to know anything execpt how to handle large registers through the ioctl
> > > interface.
> > > 
> > 
> > Migration is another reason.
> > 
> > > > I assume SVE systems will have SVE on all CPUs in an SMP system, or am I
> > > > being way too optimistic about the ARM ecosystem here?  Just like we
> > > > don't model big.LITTLE, I think we should enforce in the kernel, that
> > > 
> > > The kernel follows the same policy: if SVE is not present on all
> > > physical CPUs we disable it completely and hide it from guests and
> > > userspace.
> > > 
> > > For vector length I'm a little more permissive: the max vector length
> > > would be clamped to the minimum commonly supported vector length.
> > > 
> > 
> > Ok, so KVM could implement the same.  Or we could just be reasonable and
> > require userspace to configure all VCPUs the same.
> > 
> > > > userspace configures all VCPUs with the same SVE properties.
> > > 
> > > OK, so long as you think it's not superfluous to do it, then I'm happy
> > > to do it.
> > > 
> > > > > 
> > > > > /*
> > > > >  * For the SVE regs, we add some new reg IDs.
> > > > >  * Zn are presented in 2048-bit slices; Pn, FFR are presented in 256-bit
> > > > >  * slices.  This is sufficient for only a single slice to be required
> > > > >  * per register for SVE, but ensures expansion room in case future arch
> > > > >  * versions grow the maximum size.
> > > > >  */
> > > > 
> > > > I don't understand the last part of this comment?
> > > 
> > > This may be explained better in by response below.
> > > 
> > > > > #define KVM_REG_SIZE_U2048 (ULL(8) << KVM_REG_SIZE_MASK)
> > > > 
> > > > Shift left by KVM_REG_SIZE_MASK?  I'm confused.
> > > > 
> > > > > #define KVM_REG_ARM64_SVE_Z(n, i) /* Zn[2048 * (i + 1) - 1 : 2048 * i] */ \
> > > > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U2048 |	\
> > > > > 		((n) << 5) | (i))
> > > > > #define KVM_REG_ARM64_SVE_P(n, i) /* Pn[256 * (i + 1) - 1 : 256 * i] */	\
> > > > > 	((0x0014 << KVM_REG_ARM_COPROC_SHIFT) | KVM_REG_SIZE_U256 |	\
> > > > > 		(((n) + 32) << 5) | (i))
> > > > > #define KVM_REG_ARM64_SVE_FFR(i) /* FFR[256 * (i + 1) - 1 : 256 * i] */	\
> > > > > 	 KVM_REG_ARM64_SVE_P(16, i)
> > > > > 
> > > > > For j in [0,3], KVM_REG_ARM64_SVE_Z(n, 0) bits [32(j + 1) - 1 : 32 * j]
> > > > > 	alias KVM_REG_ARM_CORE_REG(fp_regs.vregs[n]) + j
> > > > > 
> > > > 
> > > > This is hard to read and understand the way presented here.  I would
> > > > suggest you formulate this suggestion in the form of an RFC patch to
> > > > Documentation/virtual/kvm/api.txt plus the header definitions.
> > > 
> > > Sure, I hadn't figured out the best way to present this: I was thinking
> > > aloud.
> > > 
> > > > (I'm not sure where to look to look to decode the "<< 5" and the
> > > > " + 32) << 5)" stuff above.
> > > 
> > > The idea here is that we have 49 registers: Z0-Z31, P0-P15 and FFR.
> > > They are numbered serially.
> > > 
> > > However, the SVE architecture leaves the possibility of future
> > > expansion open, up to 32 times the current maximum.
> > > 
> > > The KVM reg API doesn't support such ridiculously huge registers,
> > > so my proposal is to slice them up, indexed by the value in the
> > > bottom 5 bits of the reg ID.  This requires the "register ID"
> > > field to be shifted up by 5 bits.
> > > 
> > > If the regs are not oversized (never, for the current architecture),
> > > then we simply don't list those extra slices via KVM_GET_REG_LIST.
> > > 
> > > > > Bits above the max vector length could be
> > > > 
> > > > Which bits are these and where are they, and why do we have them?
> > > 
> > > The KVM register size via ioctl is fixed at 2048 bits here.  Since
> > > the system might not support vectors that large, then bits 2047:1024
> > > in the ioctl payload wouldn't map to any register bits in the hardware.
> > > Should KVM still store them somewhere?  Should they logically exist
> > > for the purpose of the ioctl() API?
> > > 
> > > Making the size dynamic to avoid surplus bits doesn't work, because KVM
> > > only supports power-of-two reg sizes, whereas SVE can support non-power-
> > > of-two sizes.
> > > 
> > > > Is the max vector length the max architecturally (current version)
> > > > defined length, or what is chosen for this VM?
> > > 
> > > For now, that's an open question.
> > > 
> > > > >  * don't care (or not copied at all) on read; ignored on write
> > > > >  * zero on read; ignored on write
> > > > >  * zero on read; must be zero on write
> > > > > 
> > > > > Bits between the current and max vector length are trickier to specify:
> > > > > the "current" vector length for ioctl access is ill-defined, because
> > > > > we would need to specify ordering dependencies between Zn/Pn/FFR access
> > > > > and access to ZCR_EL1.
> > > > 
> > > > I think you want userspace to be able to read/write these values in any
> > > > order compared to configuring SVE for the VM, and then fix up whatever
> > > > needs masking etc. in the kernel later, if possible.  Ordering
> > > > requirements to userspace accesses have shown to be hard to enforce and
> > > > get right in the past.
> > > 
> > > So I've heard from other people.
> > > 
> > > > What happens on hardware if you give a certain vector length to EL0, EL0
> > > > writes a value of the full length, and then EL1 restricts the length to
> > > > something smaller, and subsequently expands it again?  Is the original
> > > > full value visible or are some bits potentially lost?  IOW, can't we
> > > > rely on what the architecture says here?
> > > 
> > > The architecture says that bits that are hidden and then revealed again
> > > are either preserved whilst hidden, or zeroed.
> > > 
> > > Opinion differs on whether that's a good thing to expose in ABI: Will
> > > considered it unacceptable to expose this kind of behaviour around
> > > syscalls from userspace for example, so I currently always zero the
> > > bits in that case even though it's slightly more expensive.
> > > 
> > > The concern here was that userspace might unintentionally rely on
> > > the affected register bits being preserved around syscalls when this
> > > is not guaranteed by the implementation.
> > > 
> > > This does not mean that similar design considerations apply to the KVM
> > > ioctl interface though.
> > > 
> > 
> > It sounds to me that the most simple thing is that the register
> > interface to userspace exposes the full possible register width in both
> > directions, and we apply a mask whenever we need to.
> > 
> > > > > 
> > > > > So, it may be simpler to expose the full maximum supported vector size
> > > > > unconditionally through ioctl, and pack/unpack as necessary.
> > > > 
> > > > yes, I think this was what I tried to say.
> > > > 
> > > > > 
> > > > > Currently, data is packed in the vcpu struct in a vector length
> > > > > dependent format, since this seems optimal for low-level save/restore,
> > > > > so there will be potential data loss / zero padding when converting.
> > > > > 
> > > > > This may cause some unexpected effects.  For example:
> > > > > 
> > > > > KVM_SET_ONE_REG(ZCR_EL1, 0) 
> > > > > /* Guest's current vector length will be 128 bits when started */
> > > > > 
> > > > > KVM_SET_ONE_REG(Z0, (uint256_t)1 << 128)
> > > > > KVM_GET_ONE_REG(Z0) /* yields (uint256_t)1 << 128 */
> > > > > 
> > > > > KVM_RUN /* reg data packed down to 128-bit in vcpu struct */
> > > > > 
> > > > > KVM_GET_ONE_REG(Z0) /* yields 0 even if guest doesn't use SVE */
> > > > > 
> > > > 
> > > > I really don't know how to parse this or what the point here is?  Sorry.
> > > 
> > > It means that for the ioctl interface, "obvious" guarantees like "if you
> > > read a register you get back the last value written" don't work quite as
> > > expected.  Some bits may have disappeared, or not, depending on the
> > > precise scenario.
> > > 
> > > > > 
> > > > > Since the guest should be treated mostly as a black box, I'm not sure
> > > > > how big a deal this is.  The guest _might_ have explicitly set those
> > > > > bits to 0... who are we to say?
> > > > 
> > > > How big a deal what is?  I'm lost.
> > > > 
> > > > > 
> > > > > Can anyone think of a scenario where effects like this would matter?
> > > > > 
> > > > 
> > > > I think we need more information.
> > > 
> > > See if my comments above throw any light on this.
> > > 
> > 
> > So you're saying even if we try the "expose full width and read back
> > hidden values" approach, those hidden values may be changed when
> > executing the guest, due to the KVM implementation or the way hardware
> > works, is that the point?
> 
> Basically yes.
> 
> > I think the KVM interface should be designed similarly to being able to
> > probe a hardware CPU's register state at various stages of execution.
> > 
> > So, for example, if you write content to hidden bits in the SVE
> > registers from EL2 on real hardware and limit the length using ZCR_EL2,
> > and then run a bunch of code in EL1/0, and then come back to EL2 and
> > examine the registers again, then we should model that behavior in
> > software.
> > 
> > In other words, I think we have to model this more closely to what
> > guarantees ZCR_EL2 gives us, and not ZCR_EL1, and choose something
> > architecturally compliant which is reasonable to implement.
> 
> So, we imagine that provided the vcpu is not run in the meantime,
> all accesses to SVE regs via the KVM reg API act like they are executed
> at EL2?

Yes, userspace probing virtual EL1 state should be like EL2 probing EL1
state on hardware.

> 
> That doesn't seem unreasonable, and it removes any ordering requirement
> between ZCR_EL1 and the SVE regs providing that the vcpu isn't set
> running in the meantime.  There is no userspace access to ZCR_EL2 at
> all, if we go with the model of configuring that via attributes that
> must be configured before vcpu startup -- in which case there is no
> ordering requirement there.
> 
> The extra bits beyond ZCR_EL1.LEN may disappear as soon as the vcpu
> is run, but that is architecturally consistent behaviour at least.
> 

Yes, I think we agree here.  It will all be interesting with nested
virtualization where we have to start exposing ZCR_EL2, but that's not
for today.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] KVM API extensions for SVE
  2017-12-11 19:24           ` Christoffer Dall
@ 2017-12-12 11:12             ` Dave Martin
  -1 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2017-12-12 11:12 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Christoffer Dall, kvmarm, linux-arm-kernel

On Mon, Dec 11, 2017 at 08:24:32PM +0100, Christoffer Dall wrote:
> On Mon, Dec 11, 2017 at 02:51:36PM +0000, Dave Martin wrote:
> > On Fri, Nov 24, 2017 at 03:45:38PM +0100, Christoffer Dall wrote:

[...]

> > > So you're saying even if we try the "expose full width and read back
> > > hidden values" approach, those hidden values may be changed when
> > > executing the guest, due to the KVM implementation or the way hardware
> > > works, is that the point?
> > 
> > Basically yes.
> > 
> > > I think the KVM interface should be designed similarly to being able to
> > > probe a hardware CPU's register state at various stages of execution.
> > > 
> > > So, for example, if you write content to hidden bits in the SVE
> > > registers from EL2 on real hardware and limit the length using ZCR_EL2,
> > > and then run a bunch of code in EL1/0, and then come back to EL2 and
> > > examine the registers again, then we should model that behavior in
> > > software.
> > > 
> > > In other words, I think we have to model this more closely to what
> > > guarantees ZCR_EL2 gives us, and not ZCR_EL1, and choose something
> > > architecturally compliant which is reasonable to implement.
> > 
> > So, we imagine that provided the vcpu is not run in the meantime,
> > all accesses to SVE regs via the KVM reg API act like they are executed
> > at EL2?
> 
> Yes, userspace probing virtual EL1 state should be like EL2 probing EL1
> state on hardware.
> 
> > 
> > That doesn't seem unreasonable, and it removes any ordering requirement
> > between ZCR_EL1 and the SVE regs providing that the vcpu isn't set
> > running in the meantime.  There is no userspace access to ZCR_EL2 at
> > all, if we go with the model of configuring that via attributes that
> > must be configured before vcpu startup -- in which case there is no
> > ordering requirement there.
> > 
> > The extra bits beyond ZCR_EL1.LEN may disappear as soon as the vcpu
> > is run, but that is architecturally consistent behaviour at least.
> > 
> 
> Yes, I think we agree here.  It will all be interesting with nested
> virtualization where we have to start exposing ZCR_EL2, but that's not
> for today.

OK, that sounds reasonable.  There are a couple of options open for the
nested virt case, but we don't need to worry about it now in any case.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC] KVM API extensions for SVE
@ 2017-12-12 11:12             ` Dave Martin
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Martin @ 2017-12-12 11:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 11, 2017 at 08:24:32PM +0100, Christoffer Dall wrote:
> On Mon, Dec 11, 2017 at 02:51:36PM +0000, Dave Martin wrote:
> > On Fri, Nov 24, 2017 at 03:45:38PM +0100, Christoffer Dall wrote:

[...]

> > > So you're saying even if we try the "expose full width and read back
> > > hidden values" approach, those hidden values may be changed when
> > > executing the guest, due to the KVM implementation or the way hardware
> > > works, is that the point?
> > 
> > Basically yes.
> > 
> > > I think the KVM interface should be designed similarly to being able to
> > > probe a hardware CPU's register state at various stages of execution.
> > > 
> > > So, for example, if you write content to hidden bits in the SVE
> > > registers from EL2 on real hardware and limit the length using ZCR_EL2,
> > > and then run a bunch of code in EL1/0, and then come back to EL2 and
> > > examine the registers again, then we should model that behavior in
> > > software.
> > > 
> > > In other words, I think we have to model this more closely to what
> > > guarantees ZCR_EL2 gives us, and not ZCR_EL1, and choose something
> > > architecturally compliant which is reasonable to implement.
> > 
> > So, we imagine that provided the vcpu is not run in the meantime,
> > all accesses to SVE regs via the KVM reg API act like they are executed
> > at EL2?
> 
> Yes, userspace probing virtual EL1 state should be like EL2 probing EL1
> state on hardware.
> 
> > 
> > That doesn't seem unreasonable, and it removes any ordering requirement
> > between ZCR_EL1 and the SVE regs providing that the vcpu isn't set
> > running in the meantime.  There is no userspace access to ZCR_EL2 at
> > all, if we go with the model of configuring that via attributes that
> > must be configured before vcpu startup -- in which case there is no
> > ordering requirement there.
> > 
> > The extra bits beyond ZCR_EL1.LEN may disappear as soon as the vcpu
> > is run, but that is architecturally consistent behaviour at least.
> > 
> 
> Yes, I think we agree here.  It will all be interesting with nested
> virtualization where we have to start exposing ZCR_EL2, but that's not
> for today.

OK, that sounds reasonable.  There are a couple of options open for the
nested virt case, but we don't need to worry about it now in any case.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-12-12 11:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-21 13:49 [RFC] KVM API extensions for SVE Dave Martin
2017-11-21 13:49 ` Dave Martin
2017-11-22 19:52 ` Christoffer Dall
2017-11-22 19:52   ` Christoffer Dall
2017-11-23 18:40   ` Dave Martin
2017-11-23 18:40     ` Dave Martin
2017-11-24 14:45     ` Christoffer Dall
2017-11-24 14:45       ` Christoffer Dall
2017-11-24 15:03       ` Peter Maydell
2017-11-24 15:03         ` Peter Maydell
2017-11-25 17:40       ` Andrew Jones
2017-11-25 17:40         ` Andrew Jones
2017-12-11 14:51       ` Dave Martin
2017-12-11 14:51         ` Dave Martin
2017-12-11 19:24         ` Christoffer Dall
2017-12-11 19:24           ` Christoffer Dall
2017-12-12 11:12           ` Dave Martin
2017-12-12 11:12             ` Dave Martin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.