[PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD
@ 2021-04-22 14:38 Jan Beulich
  2021-04-22 14:43 ` [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations Jan Beulich
                   ` (21 more replies)
  0 siblings, 22 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:38 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

While the sub-groups may seem somewhat unrelated, there are inter-
dependencies (logical, functional, or contextual). The majority of
the patches are unchanged in v3, but there are a few new ones. See
individual patches for details.

The first patch here depends on "gnttab: defer allocation of status
frame tracking array", which sadly is still pending (with the
discussion, as so often, stalled).

Similarly the discussion on whether to introduce xvmalloc() in its
presented shape, or what alternative would be acceptable, has stalled.
The Working Group, so far, hasn't really helped make progress here.
One thing I was wondering whether it would help, and which I had come
to consider while seeing something similarly named elsewhere: What
about naming the new functions xnew() and xdelete() and alike? To me
this would make it more natural to suggest that new code use them in
favor of xmalloc() or vmalloc(), and that we'd (slowly) transition
existing code to use them.

01: mm: introduce xvmalloc() et al and use for grant table allocations
02: x86/xstate: use xvzalloc() for save area allocation
03: x86/xstate: re-size save area when CPUID policy changes
04: x86/xstate: re-use valid_xcr0() for boot-time checks
05: x86/xstate: drop xstate_offsets[] and xstate_sizes[]
06: x86/xstate: replace xsave_cntxt_size and drop XCNTXT_MASK
07: x86/xstate: avoid accounting for unsupported components
08: x86: use xvmalloc() for extended context buffer allocations
09: x86/xstate: enable AMX components
10: x86/CPUID: adjust extended leaves out of range clearing
11: x86/CPUID: move bounding of max_{,sub}leaf fields to library code
12: x86/CPUID: enable AMX leaves
13: x86: XFD enabling
14: x86emul: introduce X86EMUL_FPU_{tilecfg,tile}
15: x86emul: support TILERELEASE
16: x86: introduce struct for TILECFG register
17: x86emul: support {LD,ST}TILECFG
18: x86emul: support TILEZERO
19: x86emul: support TILELOADD{,T1} and TILESTORE
20: x86emul: support tile multiplication insns
21: x86emul: test AMX insns
22: x86: permit guests to use AMX and XFD

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
@ 2021-04-22 14:43 ` Jan Beulich
  2021-05-03 11:31   ` Roger Pau Monné
  2021-04-22 14:44 ` [PATCH v3 02/22] x86/xstate: use xvzalloc() for save area allocation Jan Beulich
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:43 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

All of the array allocations in grant_table_init() can exceed a page's
worth of memory, which xmalloc()-based interfaces aren't really suitable
for after boot. We also don't need any of these allocations to be
physically contiguous.. Introduce interfaces dynamically switching
between xmalloc() et al and vmalloc() et al, based on requested size,
and use them instead.

All the wrappers in the new header get cloned mostly verbatim from
xmalloc.h, with the sole adjustment to switch unsigned long to size_t
for sizes and to unsigned int for alignments.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Actually edit a copy-and-pasted comment in xvmalloc.h which was
    meant to be edited from the beginning.
---
I'm unconvinced of the mentioning of "physically contiguous" in the
comment at the top of the new header: I don't think xmalloc() provides
such a guarantee. Any use assuming so would look (latently) broken to
me.

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -37,7 +37,7 @@
 #include <xen/iommu.h>
 #include <xen/paging.h>
 #include <xen/keyhandler.h>
-#include <xen/vmap.h>
+#include <xen/xvmalloc.h>
 #include <xen/nospec.h>
 #include <xsm/xsm.h>
 #include <asm/flushtlb.h>
@@ -1749,8 +1749,8 @@ gnttab_populate_status_frames(struct dom
 
     if ( gt->status == ZERO_BLOCK_PTR )
     {
-        gt->status = xzalloc_array(grant_status_t *,
-                                   grant_to_status_frames(gt->max_grant_frames));
+        gt->status = xvzalloc_array(grant_status_t *,
+                                    grant_to_status_frames(gt->max_grant_frames));
         if ( !gt->status )
         {
             gt->status = ZERO_BLOCK_PTR;
@@ -1780,7 +1780,7 @@ status_alloc_failed:
     }
     if ( !nr_status_frames(gt) )
     {
-        xfree(gt->status);
+        xvfree(gt->status);
         gt->status = ZERO_BLOCK_PTR;
     }
     return -ENOMEM;
@@ -1852,7 +1852,7 @@ gnttab_unpopulate_status_frames(struct d
     gt->nr_status_frames = 0;
     for ( i = 0; i < n; i++ )
         free_xenheap_page(gt->status[i]);
-    xfree(gt->status);
+    xvfree(gt->status);
     gt->status = ZERO_BLOCK_PTR;
 
     return 0;
@@ -1966,21 +1966,22 @@ int grant_table_init(struct domain *d, i
     d->grant_table = gt;
 
     /* Active grant table. */
-    gt->active = xzalloc_array(struct active_grant_entry *,
-                               max_nr_active_grant_frames(gt));
+    gt->active = xvzalloc_array(struct active_grant_entry *,
+                                max_nr_active_grant_frames(gt));
     if ( gt->active == NULL )
         goto out;
 
     /* Tracking of mapped foreign frames table */
     if ( gt->max_maptrack_frames )
     {
-        gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack));
+        gt->maptrack = xvzalloc_array(struct grant_mapping *,
+                                      gt->max_maptrack_frames);
         if ( gt->maptrack == NULL )
             goto out;
     }
 
     /* Shared grant table. */
-    gt->shared_raw = xzalloc_array(void *, gt->max_grant_frames);
+    gt->shared_raw = xvzalloc_array(void *, gt->max_grant_frames);
     if ( gt->shared_raw == NULL )
         goto out;
 
@@ -3868,19 +3869,19 @@ grant_table_destroy(
 
     for ( i = 0; i < nr_grant_frames(t); i++ )
         free_xenheap_page(t->shared_raw[i]);
-    xfree(t->shared_raw);
+    xvfree(t->shared_raw);
 
     for ( i = 0; i < nr_maptrack_frames(t); i++ )
         free_xenheap_page(t->maptrack[i]);
-    vfree(t->maptrack);
+    xvfree(t->maptrack);
 
     for ( i = 0; i < nr_active_grant_frames(t); i++ )
         free_xenheap_page(t->active[i]);
-    xfree(t->active);
+    xvfree(t->active);
 
     for ( i = 0; i < nr_status_frames(t); i++ )
         free_xenheap_page(t->status[i]);
-    xfree(t->status);
+    xvfree(t->status);
 
     xfree(t);
     d->grant_table = NULL;
--- a/xen/common/vmap.c
+++ b/xen/common/vmap.c
@@ -7,6 +7,7 @@
 #include <xen/spinlock.h>
 #include <xen/types.h>
 #include <xen/vmap.h>
+#include <xen/xvmalloc.h>
 #include <asm/page.h>
 
 static DEFINE_SPINLOCK(vm_lock);
@@ -301,11 +302,29 @@ void *vzalloc(size_t size)
     return p;
 }
 
-void vfree(void *va)
+static void _vfree(const void *va, unsigned int pages, enum vmap_region type)
 {
-    unsigned int i, pages;
+    unsigned int i;
     struct page_info *pg;
     PAGE_LIST_HEAD(pg_list);
+
+    ASSERT(pages);
+
+    for ( i = 0; i < pages; i++ )
+    {
+        pg = vmap_to_page(va + i * PAGE_SIZE);
+        ASSERT(pg);
+        page_list_add(pg, &pg_list);
+    }
+    vunmap(va);
+
+    while ( (pg = page_list_remove_head(&pg_list)) != NULL )
+        free_domheap_page(pg);
+}
+
+void vfree(void *va)
+{
+    unsigned int pages;
     enum vmap_region type = VMAP_DEFAULT;
 
     if ( !va )
@@ -317,18 +336,71 @@ void vfree(void *va)
         type = VMAP_XEN;
         pages = vm_size(va, type);
     }
-    ASSERT(pages);
 
-    for ( i = 0; i < pages; i++ )
+    _vfree(va, pages, type);
+}
+
+void xvfree(void *va)
+{
+    unsigned int pages = vm_size(va, VMAP_DEFAULT);
+
+    if ( pages )
+        _vfree(va, pages, VMAP_DEFAULT);
+    else
+        xfree(va);
+}
+
+void *_xvmalloc(size_t size, unsigned int align)
+{
+    ASSERT(align <= PAGE_SIZE);
+    return size <= PAGE_SIZE ? _xmalloc(size, align) : vmalloc(size);
+}
+
+void *_xvzalloc(size_t size, unsigned int align)
+{
+    ASSERT(align <= PAGE_SIZE);
+    return size <= PAGE_SIZE ? _xzalloc(size, align) : vzalloc(size);
+}
+
+void *_xvrealloc(void *va, size_t size, unsigned int align)
+{
+    size_t pages = vm_size(va, VMAP_DEFAULT);
+    void *ptr;
+
+    ASSERT(align <= PAGE_SIZE);
+
+    if ( !pages )
     {
-        struct page_info *page = vmap_to_page(va + i * PAGE_SIZE);
+        if ( size <= PAGE_SIZE )
+            return _xrealloc(va, size, align);
 
-        ASSERT(page);
-        page_list_add(page, &pg_list);
+        ptr = vmalloc(size);
+        if ( ptr && va && va != ZERO_BLOCK_PTR )
+        {
+            /*
+             * xmalloc-based allocations up to PAGE_SIZE don't cross page
+             * boundaries. Therefore, without needing to know the exact
+             * prior allocation size, simply copy the entire tail of the
+             * page containing the earlier allocation.
+             */
+            memcpy(ptr, va, PAGE_SIZE - PAGE_OFFSET(va));
+            xfree(va);
+        }
+    }
+    else if ( pages == PFN_UP(size) )
+        ptr = va;
+    else
+    {
+        ptr = _xvmalloc(size, align);
+        if ( ptr )
+        {
+            memcpy(ptr, va, min(size, pages << PAGE_SHIFT));
+            vfree(va);
+        }
+        else if ( pages > PFN_UP(size) )
+            ptr = va;
     }
-    vunmap(va);
 
-    while ( (pg = page_list_remove_head(&pg_list)) != NULL )
-        free_domheap_page(pg);
+    return ptr;
 }
 #endif
--- /dev/null
+++ b/xen/include/xen/xvmalloc.h
@@ -0,0 +1,73 @@
+
+#ifndef __XVMALLOC_H__
+#define __XVMALLOC_H__
+
+#include <xen/cache.h>
+#include <xen/types.h>
+
+/*
+ * Xen malloc/free-style interface for allocations possibly exceeding a page's
+ * worth of memory, as long as there's no need to have physically contiguous
+ * memory allocated.  These should be used in preference to xmalloc() et al
+ * whenever the size is not known to be constrained to at most a single page.
+ */
+
+/* Allocate space for typed object. */
+#define xvmalloc(_type) ((_type *)_xvmalloc(sizeof(_type), __alignof__(_type)))
+#define xvzalloc(_type) ((_type *)_xvzalloc(sizeof(_type), __alignof__(_type)))
+
+/* Allocate space for array of typed objects. */
+#define xvmalloc_array(_type, _num) \
+    ((_type *)_xvmalloc_array(sizeof(_type), __alignof__(_type), _num))
+#define xvzalloc_array(_type, _num) \
+    ((_type *)_xvzalloc_array(sizeof(_type), __alignof__(_type), _num))
+
+/* Allocate space for a structure with a flexible array of typed objects. */
+#define xvzalloc_flex_struct(type, field, nr) \
+    ((type *)_xvzalloc(offsetof(type, field[nr]), __alignof__(type)))
+
+#define xvmalloc_flex_struct(type, field, nr) \
+    ((type *)_xvmalloc(offsetof(type, field[nr]), __alignof__(type)))
+
+/* Re-allocate space for a structure with a flexible array of typed objects. */
+#define xvrealloc_flex_struct(ptr, field, nr)                          \
+    ((typeof(ptr))_xvrealloc(ptr, offsetof(typeof(*(ptr)), field[nr]), \
+                             __alignof__(typeof(*(ptr)))))
+
+/* Allocate untyped storage. */
+#define xvmalloc_bytes(_bytes) _xvmalloc(_bytes, SMP_CACHE_BYTES)
+#define xvzalloc_bytes(_bytes) _xvzalloc(_bytes, SMP_CACHE_BYTES)
+
+/* Free any of the above. */
+extern void xvfree(void *);
+
+/* Free an allocation, and zero the pointer to it. */
+#define XVFREE(p) do { \
+    xvfree(p);         \
+    (p) = NULL;        \
+} while ( false )
+
+/* Underlying functions */
+extern void *_xvmalloc(size_t size, unsigned int align);
+extern void *_xvzalloc(size_t size, unsigned int align);
+extern void *_xvrealloc(void *ptr, size_t size, unsigned int align);
+
+static inline void *_xvmalloc_array(
+    size_t size, unsigned int align, unsigned long num)
+{
+    /* Check for overflow. */
+    if ( size && num > UINT_MAX / size )
+        return NULL;
+    return _xvmalloc(size * num, align);
+}
+
+static inline void *_xvzalloc_array(
+    size_t size, unsigned int align, unsigned long num)
+{
+    /* Check for overflow. */
+    if ( size && num > UINT_MAX / size )
+        return NULL;
+    return _xvzalloc(size * num, align);
+}
+
+#endif /* __XVMALLOC_H__ */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 02/22] x86/xstate: use xvzalloc() for save area allocation
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
  2021-04-22 14:43 ` [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations Jan Beulich
@ 2021-04-22 14:44 ` Jan Beulich
  2021-05-05 13:29   ` Roger Pau Monné
  2021-04-22 14:44 ` [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes Jan Beulich
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:44 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

This is in preparation for the area size exceeding a page's worth of
space, as will happen with AMX as well as Architectural LBR.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -8,6 +8,7 @@
 #include <xen/param.h>
 #include <xen/percpu.h>
 #include <xen/sched.h>
+#include <xen/xvmalloc.h>
 #include <asm/current.h>
 #include <asm/processor.h>
 #include <asm/hvm/support.h>
@@ -522,7 +523,7 @@ int xstate_alloc_save_area(struct vcpu *
 
     /* XSAVE/XRSTOR requires the save area be 64-byte-boundary aligned. */
     BUILD_BUG_ON(__alignof(*save_area) < 64);
-    save_area = _xzalloc(size, __alignof(*save_area));
+    save_area = _xvzalloc(size, __alignof(*save_area));
     if ( save_area == NULL )
         return -ENOMEM;
 
@@ -543,8 +544,7 @@ int xstate_alloc_save_area(struct vcpu *
 
 void xstate_free_save_area(struct vcpu *v)
 {
-    xfree(v->arch.xsave_area);
-    v->arch.xsave_area = NULL;
+    XVFREE(v->arch.xsave_area);
 }
 
 static unsigned int _xstate_ctxt_size(u64 xcr0)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
  2021-04-22 14:43 ` [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations Jan Beulich
  2021-04-22 14:44 ` [PATCH v3 02/22] x86/xstate: use xvzalloc() for save area allocation Jan Beulich
@ 2021-04-22 14:44 ` Jan Beulich
  2021-05-03 13:57   ` Andrew Cooper
  2021-04-22 14:45 ` [PATCH v3 04/22] x86/xstate: re-use valid_xcr0() for boot-time checks Jan Beulich
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:44 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné

vCPU-s get maximum size areas allocated initially. Hidden (and in
particular default-off) features may allow for a smaller size area to
suffice.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Use 1ul instead of 1ull. Re-base.
---
This could be further shrunk if we used XSAVEC / if we really used
XSAVES, as then we don't need to also cover the holes. But since we
currently use neither of the two in reality, this would require more
work than just adding the alternative size calculation here.

Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
from arch_vcpu_create(), I'm not sure we really need this two-stage
approach - the slightly longer period of time during which
v->arch.xsave_area would remain NULL doesn't look all that problematic.
But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
to stay anyway in some form, so the extra code churn may not be worth
it.

Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the
interface to use here. But it remains to be determined whether the
xcr0_accum field is meant to be inclusive of XSS (in which case it would
better be renamed) or exclusive. Right now there's no difference as we
don't support any XSS-controlled features.

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -281,7 +281,21 @@ void update_guest_memory_policy(struct v
     }
 }
 
-void domain_cpu_policy_changed(struct domain *d)
+/*
+ * Called during vcpu construction, and each time the toolstack changes the
+ * CPUID configuration for the domain.
+ */
+static int __must_check cpuid_policy_updated(struct vcpu *v)
+{
+    int rc = xstate_update_save_area(v);
+
+    if ( !rc && is_hvm_vcpu(v) )
+        hvm_cpuid_policy_changed(v);
+
+    return rc;
+}
+
+int domain_cpu_policy_changed(struct domain *d)
 {
     const struct cpuid_policy *p = d->arch.cpuid;
     struct vcpu *v;
@@ -439,13 +453,18 @@ void domain_cpu_policy_changed(struct do
 
     for_each_vcpu ( d, v )
     {
-        cpuid_policy_updated(v);
+        int rc = cpuid_policy_updated(v);
+
+        if ( rc )
+            return rc;
 
         /* If PMU version is zero then the guest doesn't have VPMU */
         if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
              p->basic.pmu_version == 0 )
             vpmu_destroy(v);
     }
+
+    return 0;
 }
 
 #ifndef CONFIG_BIGMEM
@@ -584,7 +603,7 @@ int arch_vcpu_create(struct vcpu *v)
     {
         vpmu_initialise(v);
 
-        cpuid_policy_updated(v);
+        rc = cpuid_policy_updated(v);
     }
 
     return rc;
@@ -859,11 +878,11 @@ int arch_domain_create(struct domain *d,
      */
     d->arch.x87_fip_width = cpu_has_fpu_sel ? 0 : 8;
 
-    domain_cpu_policy_changed(d);
-
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
-    return 0;
+    rc = domain_cpu_policy_changed(d);
+    if ( !rc )
+        return 0;
 
  fail:
     d->is_dying = DOMDYING_dead;
@@ -2471,16 +2490,6 @@ int domain_relinquish_resources(struct d
     return 0;
 }
 
-/*
- * Called during vcpu construction, and each time the toolstack changes the
- * CPUID configuration for the domain.
- */
-void cpuid_policy_updated(struct vcpu *v)
-{
-    if ( is_hvm_vcpu(v) )
-        hvm_cpuid_policy_changed(v);
-}
-
 void arch_dump_domain_info(struct domain *d)
 {
     paging_dump_domain_info(d);
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -89,7 +89,7 @@ static int update_domain_cpu_policy(stru
     recalculate_cpuid_policy(d);
 
     /* Recalculate relevant dom/vcpu state now the policy has changed. */
-    domain_cpu_policy_changed(d);
+    ret = domain_cpu_policy_changed(d);
 
  out:
     /* Free whichever cpuid/msr structs are not installed in struct domain. */
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -542,6 +542,41 @@ int xstate_alloc_save_area(struct vcpu *
     return 0;
 }
 
+int xstate_update_save_area(struct vcpu *v)
+{
+    unsigned int i, size, old;
+    struct xsave_struct *save_area;
+    uint64_t xcr0_max = cpuid_policy_xcr0_max(v->domain->arch.cpuid);
+
+    ASSERT(!is_idle_vcpu(v));
+
+    if ( !cpu_has_xsave )
+        return 0;
+
+    if ( v->arch.xcr0_accum & ~xcr0_max )
+        return -EBUSY;
+
+    for ( size = old = XSTATE_AREA_MIN_SIZE, i = 2; i < xstate_features; ++i )
+    {
+        if ( xcr0_max & (1ul << i) )
+            size = max(size, xstate_offsets[i] + xstate_sizes[i]);
+        if ( v->arch.xcr0_accum & (1ul << i) )
+            old = max(old, xstate_offsets[i] + xstate_sizes[i]);
+    }
+
+    save_area = _xvrealloc(v->arch.xsave_area, size, __alignof(*save_area));
+    if ( !save_area )
+        return -ENOMEM;
+
+    ASSERT(old <= size);
+    memset((void *)save_area + old, 0, size - old);
+
+    v->arch.xsave_area = save_area;
+    v->arch.fpu_ctxt = &v->arch.xsave_area->fpu_sse;
+
+    return 0;
+}
+
 void xstate_free_save_area(struct vcpu *v)
 {
     XVFREE(v->arch.xsave_area);
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -78,8 +78,6 @@ void toggle_guest_mode(struct vcpu *);
 /* x86/64: toggle guest page tables between kernel and user modes. */
 void toggle_guest_pt(struct vcpu *);
 
-void cpuid_policy_updated(struct vcpu *v);
-
 /*
  * Initialise a hypercall-transfer page. The given pointer must be mapped
  * in Xen virtual address space (accesses are not validated or checked).
@@ -670,7 +668,7 @@ struct guest_memory_policy
 void update_guest_memory_policy(struct vcpu *v,
                                 struct guest_memory_policy *policy);
 
-void domain_cpu_policy_changed(struct domain *d);
+int __must_check domain_cpu_policy_changed(struct domain *d);
 
 bool update_runstate_area(struct vcpu *);
 bool update_secondary_system_time(struct vcpu *,
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -106,6 +106,7 @@ void compress_xsave_states(struct vcpu *
 /* extended state init and cleanup functions */
 void xstate_free_save_area(struct vcpu *v);
 int xstate_alloc_save_area(struct vcpu *v);
+int xstate_update_save_area(struct vcpu *v);
 void xstate_init(struct cpuinfo_x86 *c);
 unsigned int xstate_ctxt_size(u64 xcr0);
 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 04/22] x86/xstate: re-use valid_xcr0() for boot-time checks
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (2 preceding siblings ...)
  2021-04-22 14:44 ` [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes Jan Beulich
@ 2021-04-22 14:45 ` Jan Beulich
  2021-05-03 11:53   ` Andrew Cooper
  2021-04-22 14:45 ` [PATCH v3 05/22] x86/xstate: drop xstate_offsets[] and xstate_sizes[] Jan Beulich
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:45 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

Instead of (just partially) open-coding it, re-use the function after
suitably moving it up.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -609,6 +609,34 @@ unsigned int xstate_ctxt_size(u64 xcr0)
     return _xstate_ctxt_size(xcr0);
 }
 
+static bool valid_xcr0(uint64_t xcr0)
+{
+    /* FP must be unconditionally set. */
+    if ( !(xcr0 & X86_XCR0_FP) )
+        return false;
+
+    /* YMM depends on SSE. */
+    if ( (xcr0 & X86_XCR0_YMM) && !(xcr0 & X86_XCR0_SSE) )
+        return false;
+
+    if ( xcr0 & (X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM) )
+    {
+        /* OPMASK, ZMM, and HI_ZMM require YMM. */
+        if ( !(xcr0 & X86_XCR0_YMM) )
+            return false;
+
+        /* OPMASK, ZMM, and HI_ZMM must be the same. */
+        if ( ~xcr0 & (X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM) )
+            return false;
+    }
+
+    /* BNDREGS and BNDCSR must be the same. */
+    if ( !(xcr0 & X86_XCR0_BNDREGS) != !(xcr0 & X86_XCR0_BNDCSR) )
+        return false;
+
+    return true;
+}
+
 /* Collect the information of processor's extended state */
 void xstate_init(struct cpuinfo_x86 *c)
 {
@@ -644,10 +672,9 @@ void xstate_init(struct cpuinfo_x86 *c)
     }
 
     cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
-
-    BUG_ON((eax & XSTATE_FP_SSE) != XSTATE_FP_SSE);
-    BUG_ON((eax & X86_XCR0_YMM) && !(eax & X86_XCR0_SSE));
     feature_mask = (((u64)edx << 32) | eax) & XCNTXT_MASK;
+    BUG_ON(!valid_xcr0(feature_mask));
+    BUG_ON(!(feature_mask & X86_XCR0_SSE));
 
     /*
      * Set CR4_OSXSAVE and run "cpuid" to get xsave_cntxt_size.
@@ -677,31 +704,6 @@ void xstate_init(struct cpuinfo_x86 *c)
         BUG();
 }
 
-static bool valid_xcr0(u64 xcr0)
-{
-    /* FP must be unconditionally set. */
-    if ( !(xcr0 & X86_XCR0_FP) )
-        return false;
-
-    /* YMM depends on SSE. */
-    if ( (xcr0 & X86_XCR0_YMM) && !(xcr0 & X86_XCR0_SSE) )
-        return false;
-
-    if ( xcr0 & (X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM) )
-    {
-        /* OPMASK, ZMM, and HI_ZMM require YMM. */
-        if ( !(xcr0 & X86_XCR0_YMM) )
-            return false;
-
-        /* OPMASK, ZMM, and HI_ZMM must be the same. */
-        if ( ~xcr0 & (X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM) )
-            return false;
-    }
-
-    /* BNDREGS and BNDCSR must be the same. */
-    return !(xcr0 & X86_XCR0_BNDREGS) == !(xcr0 & X86_XCR0_BNDCSR);
-}
-
 int validate_xstate(const struct domain *d, uint64_t xcr0, uint64_t xcr0_accum,
                     const struct xsave_hdr *hdr)
 {



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 05/22] x86/xstate: drop xstate_offsets[] and xstate_sizes[]
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (3 preceding siblings ...)
  2021-04-22 14:45 ` [PATCH v3 04/22] x86/xstate: re-use valid_xcr0() for boot-time checks Jan Beulich
@ 2021-04-22 14:45 ` Jan Beulich
  2021-05-03 16:10   ` Andrew Cooper
  2021-04-22 14:46 ` [PATCH v3 06/22] x86/xstate: replace xsave_cntxt_size and drop XCNTXT_MASK Jan Beulich
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:45 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

They're redundant with respective fields from the raw CPUID policy; no
need to keep two copies of the same data. This also breaks
recalculate_xstate()'s dependency on xstate_init(), allowing host CPUID
policy calculation to be moved together with that of the raw one (which
a subsequent change will require anyway).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -498,6 +498,8 @@ void identify_cpu(struct cpuinfo_x86 *c)
 	}
 
 	/* Now the feature flags better reflect actual CPU features! */
+	if (c == &boot_cpu_data)
+		init_host_cpuid();
 
 	xstate_init(c);
 
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -170,32 +170,32 @@ static void recalculate_xstate(struct cp
     {
         xstates |= X86_XCR0_YMM;
         xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_YMM_POS] +
-                          xstate_sizes[X86_XCR0_YMM_POS]);
+                          xstate_offset(X86_XCR0_YMM_POS) +
+                          xstate_size(X86_XCR0_YMM_POS));
     }
 
     if ( p->feat.mpx )
     {
         xstates |= X86_XCR0_BNDREGS | X86_XCR0_BNDCSR;
         xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_BNDCSR_POS] +
-                          xstate_sizes[X86_XCR0_BNDCSR_POS]);
+                          xstate_offset(X86_XCR0_BNDCSR_POS) +
+                          xstate_size(X86_XCR0_BNDCSR_POS));
     }
 
     if ( p->feat.avx512f )
     {
         xstates |= X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM;
         xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_HI_ZMM_POS] +
-                          xstate_sizes[X86_XCR0_HI_ZMM_POS]);
+                          xstate_offset(X86_XCR0_HI_ZMM_POS) +
+                          xstate_size(X86_XCR0_HI_ZMM_POS));
     }
 
     if ( p->feat.pku )
     {
         xstates |= X86_XCR0_PKRU;
         xstate_size = max(xstate_size,
-                          xstate_offsets[X86_XCR0_PKRU_POS] +
-                          xstate_sizes[X86_XCR0_PKRU_POS]);
+                          xstate_offset(X86_XCR0_PKRU_POS) +
+                          xstate_size(X86_XCR0_PKRU_POS));
     }
 
     p->xstate.max_size  =  xstate_size;
@@ -218,8 +218,8 @@ static void recalculate_xstate(struct cp
         if ( !(xstates & curr_xstate) )
             continue;
 
-        p->xstate.comp[i].size   = xstate_sizes[i];
-        p->xstate.comp[i].offset = xstate_offsets[i];
+        p->xstate.comp[i].size   = xstate_size(i);
+        p->xstate.comp[i].offset = xstate_offset(i);
         p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
         p->xstate.comp[i].align  = curr_xstate & xstate_align;
     }
@@ -531,10 +531,16 @@ static void __init calculate_hvm_def_pol
     x86_cpuid_policy_shrink_max_leaves(p);
 }
 
-void __init init_guest_cpuid(void)
+void __init init_host_cpuid(void)
 {
     calculate_raw_policy();
     calculate_host_policy();
+}
+
+void __init init_guest_cpuid(void)
+{
+    /* Do this a 2nd time to account for setup_{clear,force}_cpu_cap() uses. */
+    calculate_host_policy();
 
     if ( IS_ENABLED(CONFIG_PV) )
     {
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -9,6 +9,7 @@
 #include <xen/percpu.h>
 #include <xen/sched.h>
 #include <xen/xvmalloc.h>
+#include <asm/cpuid.h>
 #include <asm/current.h>
 #include <asm/processor.h>
 #include <asm/hvm/support.h>
@@ -26,8 +27,6 @@ static u32 __read_mostly xsave_cntxt_siz
 /* A 64-bit bitmask of the XSAVE/XRSTOR features supported by processor. */
 u64 __read_mostly xfeature_mask;
 
-unsigned int *__read_mostly xstate_offsets;
-unsigned int *__read_mostly xstate_sizes;
 u64 __read_mostly xstate_align;
 static unsigned int __read_mostly xstate_features;
 
@@ -93,34 +92,19 @@ static int setup_xstate_features(bool bs
     unsigned int leaf, eax, ebx, ecx, edx;
 
     if ( bsp )
-    {
         xstate_features = flsl(xfeature_mask);
-        xstate_offsets = xzalloc_array(unsigned int, xstate_features);
-        if ( !xstate_offsets )
-            return -ENOMEM;
-
-        xstate_sizes = xzalloc_array(unsigned int, xstate_features);
-        if ( !xstate_sizes )
-            return -ENOMEM;
-    }
 
     for ( leaf = 2; leaf < xstate_features; leaf++ )
     {
-        if ( bsp )
-        {
-            cpuid_count(XSTATE_CPUID, leaf, &xstate_sizes[leaf],
-                        &xstate_offsets[leaf], &ecx, &edx);
-            if ( ecx & XSTATE_ALIGN64 )
-                __set_bit(leaf, &xstate_align);
-        }
+        cpuid_count(XSTATE_CPUID, leaf, &eax,
+                    &ebx, &ecx, &edx);
+        BUG_ON(eax != xstate_size(leaf));
+        BUG_ON(ebx != xstate_offset(leaf));
+
+        if ( bsp && (ecx & XSTATE_ALIGN64) )
+            __set_bit(leaf, &xstate_align);
         else
-        {
-            cpuid_count(XSTATE_CPUID, leaf, &eax,
-                        &ebx, &ecx, &edx);
-            BUG_ON(eax != xstate_sizes[leaf]);
-            BUG_ON(ebx != xstate_offsets[leaf]);
             BUG_ON(!(ecx & XSTATE_ALIGN64) != !test_bit(leaf, &xstate_align));
-        }
     }
 
     return 0;
@@ -150,7 +134,7 @@ static void setup_xstate_comp(uint16_t *
             if ( test_bit(i, &xstate_align) )
                 offset = ROUNDUP(offset, 64);
             comp_offsets[i] = offset;
-            offset += xstate_sizes[i];
+            offset += xstate_size(i);
         }
     }
     ASSERT(offset <= xsave_cntxt_size);
@@ -213,10 +197,10 @@ void expand_xsave_states(struct vcpu *v,
          * comp_offsets[] information, something is very broken.
          */
         BUG_ON(!comp_offsets[index]);
-        BUG_ON((xstate_offsets[index] + xstate_sizes[index]) > size);
+        BUG_ON((xstate_offset(index) + xstate_size(index)) > size);
 
-        memcpy(dest + xstate_offsets[index], src + comp_offsets[index],
-               xstate_sizes[index]);
+        memcpy(dest + xstate_offset(index), src + comp_offsets[index],
+               xstate_size(index));
 
         valid &= ~feature;
     }
@@ -279,10 +263,10 @@ void compress_xsave_states(struct vcpu *
          * comp_offset[] information, something is very broken.
          */
         BUG_ON(!comp_offsets[index]);
-        BUG_ON((xstate_offsets[index] + xstate_sizes[index]) > size);
+        BUG_ON((xstate_offset(index) + xstate_size(index)) > size);
 
-        memcpy(dest + comp_offsets[index], src + xstate_offsets[index],
-               xstate_sizes[index]);
+        memcpy(dest + comp_offsets[index], src + xstate_offset(index),
+               xstate_size(index));
 
         valid &= ~feature;
     }
@@ -516,8 +500,8 @@ int xstate_alloc_save_area(struct vcpu *
         unsigned int i;
 
         for ( size = 0, i = 2; i < xstate_features; ++i )
-            if ( size < xstate_sizes[i] )
-                size = xstate_sizes[i];
+            if ( size < xstate_size(i) )
+                size = xstate_size(i);
         size += XSTATE_AREA_MIN_SIZE;
     }
 
@@ -559,9 +543,9 @@ int xstate_update_save_area(struct vcpu
     for ( size = old = XSTATE_AREA_MIN_SIZE, i = 2; i < xstate_features; ++i )
     {
         if ( xcr0_max & (1ul << i) )
-            size = max(size, xstate_offsets[i] + xstate_sizes[i]);
+            size = max(size, xstate_offset(i) + xstate_size(i));
         if ( v->arch.xcr0_accum & (1ul << i) )
-            old = max(old, xstate_offsets[i] + xstate_sizes[i]);
+            old = max(old, xstate_offset(i) + xstate_size(i));
     }
 
     save_area = _xvrealloc(v->arch.xsave_area, size, __alignof(*save_area));
@@ -819,7 +803,7 @@ uint64_t read_bndcfgu(void)
               : "=m" (*xstate)
               : "a" (X86_XCR0_BNDCSR), "d" (0), "D" (xstate) );
 
-        bndcsr = (void *)xstate + xstate_offsets[X86_XCR0_BNDCSR_POS];
+        bndcsr = (void *)xstate + xstate_offset(X86_XCR0_BNDCSR_POS);
     }
 
     if ( cr0 & X86_CR0_TS )
--- a/xen/include/asm-x86/cpuid.h
+++ b/xen/include/asm-x86/cpuid.h
@@ -16,6 +16,7 @@
 extern const uint32_t known_features[FSCAPINTS];
 extern const uint32_t special_features[FSCAPINTS];
 
+void init_host_cpuid(void);
 void init_guest_cpuid(void);
 
 /*
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -44,8 +44,9 @@ extern uint32_t mxcsr_mask;
 
 extern u64 xfeature_mask;
 extern u64 xstate_align;
-extern unsigned int *xstate_offsets;
-extern unsigned int *xstate_sizes;
+
+#define xstate_offset(n) (raw_cpuid_policy.xstate.comp[n].offset)
+#define xstate_size(n)   (raw_cpuid_policy.xstate.comp[n].size)
 
 /* extended state save area */
 struct __attribute__((aligned (64))) xsave_struct



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 06/22] x86/xstate: replace xsave_cntxt_size and drop XCNTXT_MASK
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (4 preceding siblings ...)
  2021-04-22 14:45 ` [PATCH v3 05/22] x86/xstate: drop xstate_offsets[] and xstate_sizes[] Jan Beulich
@ 2021-04-22 14:46 ` Jan Beulich
  2021-04-22 14:47 ` [PATCH v3 07/22] x86/xstate: avoid accounting for unsupported components Jan Beulich
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:46 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

XCNTXT_MASK is effectively embedded in recalculate_xstate(), and
xsave_cntxt_size was redundant with the host CPUID policy's
xstate.max_size field.

Use the host CPUID policy as input (requiring it to be calculated
earlier), thus allowing e.g. "cpuid=no-avx512f" to also result in
avoiding allocation of space for ZMM and mask register state.

Also drop a stale part of an adjacent comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -20,9 +20,10 @@
 /*
  * Maximum size (in byte) of the XSAVE/XRSTOR save area required by all
  * the supported and enabled features on the processor, including the
- * XSAVE.HEADER. We only enable XCNTXT_MASK that we have known.
+ * XSAVE.HEADER. We only enable cpuid_policy_xcr0_max(&host_cpuid_policy).
+ * Note that this identifier should not be usable as an lvalue.
  */
-static u32 __read_mostly xsave_cntxt_size;
+#define xsave_cntxt_size (host_cpuid_policy.xstate.max_size | 0)
 
 /* A 64-bit bitmask of the XSAVE/XRSTOR features supported by processor. */
 u64 __read_mostly xfeature_mask;
@@ -575,8 +576,23 @@ static unsigned int _xstate_ctxt_size(u6
     ASSERT(ok);
     cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
     ASSERT(ebx <= ecx);
-    ok = set_xcr0(act_xcr0);
-    ASSERT(ok);
+
+    /*
+     * When called the very first time from xstate_init(), act_xcr0 (as read
+     * from per-CPU data) is still zero. xstate_init() wants this function to
+     * leave xfeature_mask in place, so avoid restoration in this case (which
+     * would fail anyway).
+     */
+    if ( act_xcr0 )
+    {
+        ok = set_xcr0(act_xcr0);
+        ASSERT(ok);
+    }
+    else
+    {
+        BUG_ON(!ok);
+        ASSERT(xcr0 == xfeature_mask);
+    }
 
     return ebx;
 }
@@ -648,42 +664,35 @@ void xstate_init(struct cpuinfo_x86 *c)
         return;
 
     if ( (bsp && !use_xsave) ||
-         boot_cpu_data.cpuid_level < XSTATE_CPUID )
+         c->cpuid_level < XSTATE_CPUID )
     {
         BUG_ON(!bsp);
         setup_clear_cpu_cap(X86_FEATURE_XSAVE);
         return;
     }
 
-    cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
-    feature_mask = (((u64)edx << 32) | eax) & XCNTXT_MASK;
-    BUG_ON(!valid_xcr0(feature_mask));
-    BUG_ON(!(feature_mask & X86_XCR0_SSE));
-
-    /*
-     * Set CR4_OSXSAVE and run "cpuid" to get xsave_cntxt_size.
-     */
-    set_in_cr4(X86_CR4_OSXSAVE);
-    if ( !set_xcr0(feature_mask) )
-        BUG();
-
     if ( bsp )
     {
+        feature_mask = cpuid_policy_xcr0_max(&host_cpuid_policy);
+        BUG_ON(!valid_xcr0(feature_mask));
+        BUG_ON(!(feature_mask & X86_XCR0_SSE));
+
         xfeature_mask = feature_mask;
-        /*
-         * xsave_cntxt_size is the max size required by enabled features.
-         * We know FP/SSE and YMM about eax, and nothing about edx at present.
-         */
-        xsave_cntxt_size = _xstate_ctxt_size(feature_mask);
+        /* xsave_cntxt_size is the max size required by enabled features. */
         printk("xstate: size: %#x and states: %#"PRIx64"\n",
-               xsave_cntxt_size, xfeature_mask);
-    }
-    else
-    {
-        BUG_ON(xfeature_mask != feature_mask);
-        BUG_ON(xsave_cntxt_size != _xstate_ctxt_size(feature_mask));
+               xsave_cntxt_size, feature_mask);
+
+        set_in_cr4(X86_CR4_OSXSAVE);
     }
 
+    cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
+    feature_mask = (((uint64_t)edx << 32) | eax) & xfeature_mask;
+    BUG_ON(xfeature_mask != feature_mask);
+
+    /* This has the side effect of set_xcr0(feature_mask). */
+    if ( xsave_cntxt_size != _xstate_ctxt_size(feature_mask) )
+        BUG();
+
     if ( setup_xstate_features(bsp) && bsp )
         BUG();
 }
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -30,9 +30,6 @@ extern uint32_t mxcsr_mask;
 #define XSTATE_AREA_MIN_SIZE      (FXSAVE_SIZE + XSAVE_HDR_SIZE)
 
 #define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
-#define XCNTXT_MASK    (X86_XCR0_FP | X86_XCR0_SSE | X86_XCR0_YMM | \
-                        X86_XCR0_OPMASK | X86_XCR0_ZMM | X86_XCR0_HI_ZMM | \
-                        XSTATE_NONLAZY)
 
 #define XSTATE_ALL     (~(1ULL << 63))
 #define XSTATE_NONLAZY (X86_XCR0_BNDREGS | X86_XCR0_BNDCSR | X86_XCR0_PKRU)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 07/22] x86/xstate: avoid accounting for unsupported components
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (5 preceding siblings ...)
  2021-04-22 14:46 ` [PATCH v3 06/22] x86/xstate: replace xsave_cntxt_size and drop XCNTXT_MASK Jan Beulich
@ 2021-04-22 14:47 ` Jan Beulich
  2021-04-22 14:47 ` [PATCH v3 08/22] x86: use xvmalloc() for extended context buffer allocations Jan Beulich
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:47 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

There's no point in including unsupported components in the size
calculations of xstate_{alloc,update}_save_area().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -501,8 +501,12 @@ int xstate_alloc_save_area(struct vcpu *
         unsigned int i;
 
         for ( size = 0, i = 2; i < xstate_features; ++i )
+        {
+            if ( !(xfeature_mask & (1ul << i)) )
+                continue;
             if ( size < xstate_size(i) )
                 size = xstate_size(i);
+        }
         size += XSTATE_AREA_MIN_SIZE;
     }
 
@@ -543,6 +547,8 @@ int xstate_update_save_area(struct vcpu
 
     for ( size = old = XSTATE_AREA_MIN_SIZE, i = 2; i < xstate_features; ++i )
     {
+        if ( !(xfeature_mask & (1ul << i)) )
+            continue;
         if ( xcr0_max & (1ul << i) )
             size = max(size, xstate_offset(i) + xstate_size(i));
         if ( v->arch.xcr0_accum & (1ul << i) )



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 08/22] x86: use xvmalloc() for extended context buffer allocations
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (6 preceding siblings ...)
  2021-04-22 14:47 ` [PATCH v3 07/22] x86/xstate: avoid accounting for unsupported components Jan Beulich
@ 2021-04-22 14:47 ` Jan Beulich
  2021-04-22 14:48 ` [PATCH v3 09/22] x86/xstate: enable AMX components Jan Beulich
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:47 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

This is in preparation for the buffer sizes exceeding a page's worth of
space, as will happen with AMX as well as Architectural LBR.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -30,6 +30,7 @@
 #include <xsm/xsm.h>
 #include <xen/iommu.h>
 #include <xen/vm_event.h>
+#include <xen/xvmalloc.h>
 #include <public/vm_event.h>
 #include <asm/mem_sharing.h>
 #include <asm/xstate.h>
@@ -380,7 +381,7 @@ long arch_do_domctl(
             goto sethvmcontext_out;
 
         ret = -ENOMEM;
-        if ( (c.data = xmalloc_bytes(c.size)) == NULL )
+        if ( (c.data = xvmalloc_bytes(c.size)) == NULL )
             goto sethvmcontext_out;
 
         ret = -EFAULT;
@@ -392,7 +393,7 @@ long arch_do_domctl(
         domain_unpause(d);
 
     sethvmcontext_out:
-        xfree(c.data);
+        xvfree(c.data);
         break;
     }
 
@@ -422,7 +423,7 @@ long arch_do_domctl(
 
         /* Allocate our own marshalling buffer */
         ret = -ENOMEM;
-        if ( (c.data = xmalloc_bytes(c.size)) == NULL )
+        if ( (c.data = xvmalloc_bytes(c.size)) == NULL )
             goto gethvmcontext_out;
 
         domain_pause(d);
@@ -435,7 +436,7 @@ long arch_do_domctl(
 
     gethvmcontext_out:
         copyback = true;
-        xfree(c.data);
+        xvfree(c.data);
         break;
     }
 
@@ -953,7 +954,7 @@ long arch_do_domctl(
             if ( !ret && size > PV_XSAVE_HDR_SIZE )
             {
                 unsigned int xsave_size = size - PV_XSAVE_HDR_SIZE;
-                void *xsave_area = xmalloc_bytes(xsave_size);
+                void *xsave_area = xvmalloc_bytes(xsave_size);
 
                 if ( !xsave_area )
                 {
@@ -967,7 +968,7 @@ long arch_do_domctl(
                 if ( copy_to_guest_offset(evc->buffer, offset, xsave_area,
                                           xsave_size) )
                      ret = -EFAULT;
-                xfree(xsave_area);
+                xvfree(xsave_area);
            }
 
             vcpu_unpause(v);
@@ -987,7 +988,7 @@ long arch_do_domctl(
                  evc->size > PV_XSAVE_SIZE(xfeature_mask) )
                 goto vcpuextstate_out;
 
-            receive_buf = xmalloc_bytes(evc->size);
+            receive_buf = xvmalloc_bytes(evc->size);
             if ( !receive_buf )
             {
                 ret = -ENOMEM;
@@ -997,7 +998,7 @@ long arch_do_domctl(
                                         offset, evc->size) )
             {
                 ret = -EFAULT;
-                xfree(receive_buf);
+                xvfree(receive_buf);
                 goto vcpuextstate_out;
             }
 
@@ -1015,7 +1016,7 @@ long arch_do_domctl(
                 ret = 0;
             if ( ret )
             {
-                xfree(receive_buf);
+                xvfree(receive_buf);
                 goto vcpuextstate_out;
             }
 
@@ -1043,7 +1044,7 @@ long arch_do_domctl(
                 vcpu_unpause(v);
             }
 
-            xfree(receive_buf);
+            xvfree(receive_buf);
         }
 
 #undef PV_XSAVE_HDR_SIZE
--- a/xen/arch/x86/hvm/save.c
+++ b/xen/arch/x86/hvm/save.c
@@ -23,6 +23,7 @@
 #include <xen/guest_access.h>
 #include <xen/softirq.h>
 #include <xen/version.h>
+#include <xen/xvmalloc.h>
 
 #include <asm/hvm/support.h>
 
@@ -154,7 +155,7 @@ int hvm_save_one(struct domain *d, unsig
     else
         v = d->vcpu[instance];
     ctxt.size = hvm_sr_handlers[typecode].size;
-    ctxt.data = xmalloc_bytes(ctxt.size);
+    ctxt.data = xvmalloc_bytes(ctxt.size);
     if ( !ctxt.data )
         return -ENOMEM;
 
@@ -200,7 +201,7 @@ int hvm_save_one(struct domain *d, unsig
     else
         domain_unpause(d);
 
-    xfree(ctxt.data);
+    xvfree(ctxt.data);
     return rv;
 }
 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 09/22] x86/xstate: enable AMX components
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (7 preceding siblings ...)
  2021-04-22 14:47 ` [PATCH v3 08/22] x86: use xvmalloc() for extended context buffer allocations Jan Beulich
@ 2021-04-22 14:48 ` Jan Beulich
  2021-04-22 14:50 ` [PATCH v3 10/22] x86/CPUID: adjust extended leaves out of range clearing Jan Beulich
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:48 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Roger Pau Monné,
	Anthony Perard

These being controlled by XCR0, enabling support is relatively
straightforward. Note however that there won't be any use of them until
their dependent ISA extension CPUID flags get exposed, not the least due
to recalculate_xstate() handling the dependencies in kind of a reverse
manner.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Add new states to XSTATE_NONLAZY.
v2: New.

--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -221,6 +221,9 @@ int libxl_cpuid_parse_config(libxl_cpuid
         {"md-clear",     0x00000007,  0, CPUID_REG_EDX, 10,  1},
         {"serialize",    0x00000007,  0, CPUID_REG_EDX, 14,  1},
         {"cet-ibt",      0x00000007,  0, CPUID_REG_EDX, 20,  1},
+        {"amx-bf16",     0x00000007,  0, CPUID_REG_EDX, 22,  1},
+        {"amx-tile",     0x00000007,  0, CPUID_REG_EDX, 24,  1},
+        {"amx-int8",     0x00000007,  0, CPUID_REG_EDX, 25,  1},
         {"ibrsb",        0x00000007,  0, CPUID_REG_EDX, 26,  1},
         {"stibp",        0x00000007,  0, CPUID_REG_EDX, 27,  1},
         {"l1d-flush",    0x00000007,  0, CPUID_REG_EDX, 28,  1},
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -168,7 +168,8 @@ static const char *const str_7d0[32] =
 
     [18] = "pconfig",
     [20] = "cet-ibt",
-
+    [22] = "amx-bf16",
+    [24] = "amx-tile",      [25] = "amx-int8",
     [26] = "ibrsb",         [27] = "stibp",
     [28] = "l1d-flush",     [29] = "arch-caps",
     [30] = "core-caps",     [31] = "ssbd",
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -198,6 +198,14 @@ static void recalculate_xstate(struct cp
                           xstate_size(X86_XCR0_PKRU_POS));
     }
 
+    if ( p->feat.amx_tile )
+    {
+        xstates |= X86_XCR0_TILECFG | X86_XCR0_TILEDATA;
+        xstate_size = max(xstate_size,
+                          xstate_offset(X86_XCR0_TILEDATA_POS) +
+                          xstate_size(X86_XCR0_TILEDATA_POS));
+    }
+
     p->xstate.max_size  =  xstate_size;
     p->xstate.xcr0_low  =  xstates & ~XSTATE_XSAVES_ONLY;
     p->xstate.xcr0_high = (xstates & ~XSTATE_XSAVES_ONLY) >> 32;
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -640,6 +640,10 @@ static bool valid_xcr0(uint64_t xcr0)
     if ( !(xcr0 & X86_XCR0_BNDREGS) != !(xcr0 & X86_XCR0_BNDCSR) )
         return false;
 
+    /* TILECFG and TILEDATA must be the same. */
+    if ( !(xcr0 & X86_XCR0_TILECFG) != !(xcr0 & X86_XCR0_TILEDATA) )
+        return false;
+
     return true;
 }
 
--- a/xen/include/asm-x86/x86-defns.h
+++ b/xen/include/asm-x86/x86-defns.h
@@ -96,6 +96,10 @@
 #define X86_XCR0_HI_ZMM           (1ULL << X86_XCR0_HI_ZMM_POS)
 #define X86_XCR0_PKRU_POS         9
 #define X86_XCR0_PKRU             (1ULL << X86_XCR0_PKRU_POS)
+#define X86_XCR0_TILECFG_POS      17
+#define X86_XCR0_TILECFG          (1ULL << X86_XCR0_TILECFG_POS)
+#define X86_XCR0_TILEDATA_POS     18
+#define X86_XCR0_TILEDATA         (1ULL << X86_XCR0_TILEDATA_POS)
 #define X86_XCR0_LWP_POS          62
 #define X86_XCR0_LWP              (1ULL << X86_XCR0_LWP_POS)
 
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -32,7 +32,8 @@ extern uint32_t mxcsr_mask;
 #define XSTATE_FP_SSE  (X86_XCR0_FP | X86_XCR0_SSE)
 
 #define XSTATE_ALL     (~(1ULL << 63))
-#define XSTATE_NONLAZY (X86_XCR0_BNDREGS | X86_XCR0_BNDCSR | X86_XCR0_PKRU)
+#define XSTATE_NONLAZY (X86_XCR0_BNDREGS | X86_XCR0_BNDCSR | X86_XCR0_PKRU | \
+                        X86_XCR0_TILECFG | X86_XCR0_TILEDATA)
 #define XSTATE_LAZY    (XSTATE_ALL & ~XSTATE_NONLAZY)
 #define XSTATE_XSAVES_ONLY         0
 #define XSTATE_COMPACTION_ENABLED  (1ULL << 63)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -268,6 +268,9 @@ XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*a  SERIALIZE insn */
 XEN_CPUFEATURE(CET_IBT,       9*32+20) /*   CET - Indirect Branch Tracking */
+XEN_CPUFEATURE(AMX_BF16,      9*32+22) /*   AMX BFloat16 instructions */
+XEN_CPUFEATURE(AMX_TILE,      9*32+24) /*   AMX tile architecture */
+XEN_CPUFEATURE(AMX_INT8,      9*32+25) /*   AMX 8-bit integer instructions */
 XEN_CPUFEATURE(IBRSB,         9*32+26) /*A  IBRS and IBPB support (used by Intel) */
 XEN_CPUFEATURE(STIBP,         9*32+27) /*A  STIBP */
 XEN_CPUFEATURE(L1D_FLUSH,     9*32+28) /*S  MSR_FLUSH_CMD and L1D flush. */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -222,7 +222,7 @@ def crunch_numbers(state):
         # instruction groups which are specified to require XSAVE for state
         # management.
         XSAVE: [XSAVEOPT, XSAVEC, XGETBV1, XSAVES,
-                AVX, MPX, PKU, LWP],
+                AVX, MPX, PKU, AMX_TILE, LWP],
 
         # AVX is taken to mean hardware support for 256bit registers (which in
         # practice depends on the VEX prefix to encode), and the instructions
@@ -290,6 +290,11 @@ def crunch_numbers(state):
 
         # In principle the TSXLDTRK insns could also be considered independent.
         RTM: [TSXLDTRK],
+
+        # AMX-TILE means hardware support for tile registers and general non-
+        # computational instructions.  All further AMX features are built on top
+        # of AMX-TILE.
+        AMX_TILE: [AMX_BF16, AMX_INT8],
     }
 
     deep_features = tuple(sorted(deps.keys()))



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 10/22] x86/CPUID: adjust extended leaves out of range clearing
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (8 preceding siblings ...)
  2021-04-22 14:48 ` [PATCH v3 09/22] x86/xstate: enable AMX components Jan Beulich
@ 2021-04-22 14:50 ` Jan Beulich
  2021-04-22 14:50 ` [PATCH v3 11/22] x86/CPUID: move bounding of max_{,sub}leaf fields to library code Jan Beulich
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

A maximum extended leaf input value with the high half different from
0x8000 should not be considered valid - all leaves should be cleared in
this case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
TBD: Andrew suggested to drop this patch, but that sub-thread still has
     a loose end. Hence, until I'm convinced otherwise, I've retained
     this patch here. I don't think it conflicts with any of the
     subsequent ones.
---
v2: Integrate into series.

--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -519,11 +519,22 @@ static void test_cpuid_out_of_range_clea
             },
         },
         {
+            .name = "no extd",
+            .nr_markers = 0,
+            .p = {
+                /* Clears all markers. */
+                .extd.max_leaf = 0,
+
+                .extd.vendor_ebx = 0xc2,
+                .extd.raw_fms = 0xc2,
+            },
+        },
+        {
             .name = "extd",
             .nr_markers = 1,
             .p = {
                 /* Retains marker in leaf 0.  Clears others. */
-                .extd.max_leaf = 0,
+                .extd.max_leaf = 0x80000000,
                 .extd.vendor_ebx = 0xc2,
 
                 .extd.raw_fms = 0xc2,
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -232,7 +232,9 @@ void x86_cpuid_policy_clear_out_of_range
                     ARRAY_SIZE(p->xstate.raw) - 1);
     }
 
-    zero_leaves(p->extd.raw, (p->extd.max_leaf & 0xffff) + 1,
+    zero_leaves(p->extd.raw,
+                ((p->extd.max_leaf >> 16) == 0x8000
+                 ? (p->extd.max_leaf & 0xffff) + 1 : 0),
                 ARRAY_SIZE(p->extd.raw) - 1);
 }
 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 11/22] x86/CPUID: move bounding of max_{,sub}leaf fields to library code
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (9 preceding siblings ...)
  2021-04-22 14:50 ` [PATCH v3 10/22] x86/CPUID: adjust extended leaves out of range clearing Jan Beulich
@ 2021-04-22 14:50 ` Jan Beulich
  2021-04-22 14:51 ` [PATCH v3 12/22] x86/CPUID: enable AMX leaves Jan Beulich
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

Break out this logic from calculate_host_policy() to also use it in the
x86 emulator harness, where subsequently we'll want to avoid open-coding
AMX maximum palette bounding.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -79,6 +79,7 @@ bool emul_test_init(void)
     unsigned long sp;
 
     x86_cpuid_policy_fill_native(&cp);
+    x86_cpuid_policy_bound_max_leaves(&cp);
 
     /*
      * The emulator doesn't use these instructions, so can always emulate
@@ -91,6 +92,8 @@ bool emul_test_init(void)
     cp.feat.rdpid = true;
     cp.extd.clzero = true;
 
+    x86_cpuid_policy_shrink_max_leaves(&cp);
+
     if ( cpu_has_xsave )
     {
         unsigned int tmp, ebx;
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -322,12 +322,7 @@ static void __init calculate_host_policy
 
     *p = raw_cpuid_policy;
 
-    p->basic.max_leaf =
-        min_t(uint32_t, p->basic.max_leaf,   ARRAY_SIZE(p->basic.raw) - 1);
-    p->feat.max_subleaf =
-        min_t(uint32_t, p->feat.max_subleaf, ARRAY_SIZE(p->feat.raw) - 1);
-    p->extd.max_leaf = 0x80000000 | min_t(uint32_t, p->extd.max_leaf & 0xffff,
-                                          ARRAY_SIZE(p->extd.raw) - 1);
+    x86_cpuid_policy_bound_max_leaves(p);
 
     cpuid_featureset_to_policy(boot_cpu_data.x86_capability, p);
     recalculate_xstate(p);
--- a/xen/include/xen/lib/x86/cpuid.h
+++ b/xen/include/xen/lib/x86/cpuid.h
@@ -352,6 +352,12 @@ void x86_cpuid_policy_fill_native(struct
 void x86_cpuid_policy_clear_out_of_range_leaves(struct cpuid_policy *p);
 
 /**
+ * Bound max leaf/subleaf values according to the capacity of the respective
+ * arrays in struct cpuid_policy.
+ */
+void x86_cpuid_policy_bound_max_leaves(struct cpuid_policy *p);
+
+/**
  * Shrink max leaf/subleaf values such that the last respective valid entry
  * isn't all blank.  While permitted by the spec, such extraneous leaves may
  * provide undue "hints" to guests.
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -238,6 +238,16 @@ void x86_cpuid_policy_clear_out_of_range
                 ARRAY_SIZE(p->extd.raw) - 1);
 }
 
+void x86_cpuid_policy_bound_max_leaves(struct cpuid_policy *p)
+{
+    p->basic.max_leaf =
+        min_t(uint32_t, p->basic.max_leaf, ARRAY_SIZE(p->basic.raw) - 1);
+    p->feat.max_subleaf =
+        min_t(uint32_t, p->feat.max_subleaf, ARRAY_SIZE(p->feat.raw) - 1);
+    p->extd.max_leaf = 0x80000000 | min_t(uint32_t, p->extd.max_leaf & 0xffff,
+                                          ARRAY_SIZE(p->extd.raw) - 1);
+}
+
 void x86_cpuid_policy_shrink_max_leaves(struct cpuid_policy *p)
 {
     unsigned int i;



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 12/22] x86/CPUID: enable AMX leaves
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (10 preceding siblings ...)
  2021-04-22 14:50 ` [PATCH v3 11/22] x86/CPUID: move bounding of max_{,sub}leaf fields to library code Jan Beulich
@ 2021-04-22 14:51 ` Jan Beulich
  2021-04-22 14:52 ` [PATCH v3 13/22] x86: XFD enabling Jan Beulich
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:51 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

This requires bumping the number of basic leaves we support. Apart from
this the logic is modeled as closely as possible to that of leaf 7
handling.

The checks in x86_cpu_policies_are_compatible() may be more strict than
they ultimately need to be, but I'd rather start being on the safe side.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.
---
It's not clear to me in how far libxl_cpuid.c would want extending: It
doesn't look to offer a way to override the maximum subleaf of leaf 7.
In fact I can't seem to be able to spot a max extended leaf override
mechanism either.

--- a/tools/tests/cpu-policy/test-cpu-policy.c
+++ b/tools/tests/cpu-policy/test-cpu-policy.c
@@ -190,6 +190,40 @@ static void test_cpuid_serialise_success
             },
             .nr_leaves = 4 + 0xd + 1 + 1,
         },
+
+        /* Leaf 0x1d serialisation stops at max_palette. */
+        {
+            .name = "empty leaf 0x1d",
+            .p = {
+                .basic.max_leaf = 0x1d,
+            },
+            .nr_leaves = 4 + 0x1d + 1,
+        },
+        {
+            .name = "partial leaf 0x1d",
+            .p = {
+                .basic.max_leaf = 0x1d,
+                .tile.max_palette = 1,
+            },
+            .nr_leaves = 4 + 0x1d + 1 + 1,
+        },
+
+        /* Leaf 0x1e serialisation stops at 0. */
+        {
+            .name = "empty leaf 0x1e",
+            .p = {
+                .basic.max_leaf = 0x1e,
+            },
+            .nr_leaves = 4 + 0x1e + 1,
+        },
+        {
+            .name = "partial leaf 0x1e",
+            .p = {
+                .basic.max_leaf = 0x1e,
+                .tmul.maxk = 16,
+            },
+            .nr_leaves = 4 + 0x1e + 1,
+        },
     };
 
     printf("Testing CPUID serialise success:\n");
@@ -321,6 +355,14 @@ static void test_cpuid_deserialise_failu
             .leaf = { .leaf = 0xd, .subleaf = CPUID_GUEST_NR_XSTATE },
         },
         {
+            .name = "OoB tile leaf",
+            .leaf = { .leaf = 0x1d, .subleaf = CPUID_GUEST_NR_PALETTE },
+        },
+        {
+            .name = "OoB tmul leaf",
+            .leaf = { .leaf = 0x1e, .subleaf = CPUID_GUEST_NR_TMUL },
+        },
+        {
             .name = "OoB extd leaf",
             .leaf = { .leaf = 0x80000000 | CPUID_GUEST_NR_EXTD },
         },
@@ -432,6 +474,8 @@ static void test_cpuid_out_of_range_clea
                 .topo.raw[0].a = 0xc2,
                 .xstate.raw[0].a = 0xc2,
                 .xstate.raw[1].a = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -447,6 +491,8 @@ static void test_cpuid_out_of_range_clea
                 .topo.raw[0].a = 0xc2,
                 .xstate.raw[0].a = 0xc2,
                 .xstate.raw[1].a = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -461,6 +507,8 @@ static void test_cpuid_out_of_range_clea
                 .topo.raw[0].a = 0xc2,
                 .xstate.raw[0].a = 0xc2,
                 .xstate.raw[1].a = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -474,6 +522,8 @@ static void test_cpuid_out_of_range_clea
                 .topo.raw[1].b = 0xc2,
                 .xstate.raw[0].a = 0xc2,
                 .xstate.raw[1].a = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -488,6 +538,8 @@ static void test_cpuid_out_of_range_clea
 
                 .xstate.raw[2].b = 0xc2,
                 .xstate.raw[3].b = 0xc2,
+                .tile.raw[0].a = 0xc2,
+                .tmul.raw[0].a = 0xc2,
             },
         },
         {
@@ -530,6 +582,34 @@ static void test_cpuid_out_of_range_clea
             },
         },
         {
+            .name = "tile no palette",
+            .nr_markers = 0,
+            .p = {
+                /* First two subleaves invalid as a pair.  Others cleared. */
+                .basic.max_leaf = 0x1d,
+                .xstate.xcr0_low = XSTATE_FP_SSE,
+
+                .tile.raw[0].a = 0xc2,
+                .tile.raw[1].b = 0xc2,
+                .tmul.raw[0].a = 0xc2,
+            },
+        },
+        {
+            .name = "tile palette 1",
+            .nr_markers = 1,
+            .p = {
+                /* First two subleaves valid as a pair.  Others cleared. */
+                .basic.max_leaf = 0x1d,
+                .feat.amx_tile = 1,
+                .xstate.xcr0_low = XSTATE_FP_SSE | X86_XCR0_TILECFG |
+                                   X86_XCR0_TILEDATA,
+                .tile.raw[0].a = 1,
+                .tile.raw[1].b = 0xc2,
+
+                .tmul.raw[0].a = 0xc2,
+            },
+        },
+        {
             .name = "extd",
             .nr_markers = 1,
             .p = {
@@ -624,6 +704,24 @@ static void test_cpuid_maximum_leaf_shri
             },
         },
         {
+            .name = "tile",
+            .p = {
+                /* Subleaf 1 only with some valid value. */
+                .basic.max_leaf = 0x1d,
+                .tile.raw[0].a = 1,
+                .tile.raw[1].a = 1024,
+            },
+        },
+        {
+            .name = "tmul",
+            .p = {
+                /* Subleaf 0 only with some valid values. */
+                .basic.max_leaf = 0x1e,
+                .tmul.maxk = 16,
+                .tmul.maxn = 16,
+            },
+        },
+        {
             .name = "extd",
             .p = {
                 /* Commonly available information only. */
@@ -643,6 +741,7 @@ static void test_cpuid_maximum_leaf_shri
 
         p->basic.max_leaf = ARRAY_SIZE(p->basic.raw) - 1;
         p->feat.max_subleaf = ARRAY_SIZE(p->feat.raw) - 1;
+        p->tile.max_palette = ARRAY_SIZE(p->tile.raw) - 1;
         p->extd.max_leaf = 0x80000000 | (ARRAY_SIZE(p->extd.raw) - 1);
 
         x86_cpuid_policy_shrink_max_leaves(p);
@@ -660,6 +759,10 @@ static void test_cpuid_maximum_leaf_shri
              fail("  Test %s feat fail - expected %#x, got %#x\n",
                   t->name, t->p.feat.max_subleaf, p->feat.max_subleaf);
 
+        if ( p->tile.max_palette != t->p.tile.max_palette )
+             fail("  Test %s tile fail - expected %#x, got %#x\n",
+                  t->name, t->p.tile.max_palette, p->tile.max_palette);
+
         free(p);
     }
 }
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -233,6 +233,29 @@ static void recalculate_xstate(struct cp
     }
 }
 
+static void recalculate_tile(struct cpuid_policy *p)
+{
+    unsigned int i;
+
+    if ( !p->feat.amx_tile )
+    {
+        memset(&p->tile, 0, sizeof(p->tile));
+        memset(&p->tmul, 0, sizeof(p->tmul));
+        return;
+    }
+
+    p->tile.raw[0].b = p->tile.raw[0].c = p->tile.raw[0].d = 0;
+
+    for ( i = 1; i <= p->tile.max_palette; ++i )
+    {
+        p->tile.raw[i].c &= 0x0000ffff;
+        p->tile.raw[i].d = 0;
+    }
+
+    p->tmul.raw[0].a = p->tmul.raw[0].c = p->tmul.raw[0].d = 0;
+    p->tmul.raw[0].b &= 0x00ffffff;
+}
+
 /*
  * Misc adjustments to the policy.  Mostly clobbering reserved fields and
  * duplicating shared fields.  Intentionally hidden fields are annotated.
@@ -252,6 +275,8 @@ static void recalculate_misc(struct cpui
 
     p->basic.raw[0xc] = EMPTY_LEAF;
 
+    zero_leaves(p->basic.raw, 0xe, 0x1c);
+
     p->extd.e1d &= ~CPUID_COMMON_1D_FEATURES;
 
     /* Most of Power/RAS hidden from guests. */
@@ -326,6 +351,7 @@ static void __init calculate_host_policy
 
     cpuid_featureset_to_policy(boot_cpu_data.x86_capability, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
     recalculate_misc(p);
 
     /* When vPMU is disabled, drop it from the host policy. */
@@ -413,6 +439,7 @@ static void __init calculate_pv_max_poli
     sanitise_featureset(pv_featureset);
     cpuid_featureset_to_policy(pv_featureset, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
 
     p->extd.raw[0xa] = EMPTY_LEAF; /* No SVM for PV guests. */
 
@@ -437,6 +464,7 @@ static void __init calculate_pv_def_poli
     sanitise_featureset(pv_featureset);
     cpuid_featureset_to_policy(pv_featureset, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
 
     x86_cpuid_policy_shrink_max_leaves(p);
 }
@@ -504,6 +532,7 @@ static void __init calculate_hvm_max_pol
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
 
     x86_cpuid_policy_shrink_max_leaves(p);
 }
@@ -530,6 +559,7 @@ static void __init calculate_hvm_def_pol
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
+    recalculate_tile(p);
 
     x86_cpuid_policy_shrink_max_leaves(p);
 }
@@ -600,6 +630,7 @@ void recalculate_cpuid_policy(struct dom
 
     p->basic.max_leaf   = min(p->basic.max_leaf,   max->basic.max_leaf);
     p->feat.max_subleaf = min(p->feat.max_subleaf, max->feat.max_subleaf);
+    p->tile.max_palette = min(p->tile.max_palette, max->tile.max_palette);
     p->extd.max_leaf    = 0x80000000 | min(p->extd.max_leaf & 0xffff,
                                            ((p->x86_vendor & (X86_VENDOR_AMD |
                                                               X86_VENDOR_HYGON))
@@ -690,6 +721,7 @@ void recalculate_cpuid_policy(struct dom
     p->extd.maxlinaddr = p->extd.lm ? 48 : 32;
 
     recalculate_xstate(p);
+    recalculate_tile(p);
     recalculate_misc(p);
 
     for ( i = 0; i < ARRAY_SIZE(p->cache.raw); ++i )
@@ -812,6 +844,22 @@ void guest_cpuid(const struct vcpu *v, u
             *res = array_access_nospec(p->xstate.raw, subleaf);
             break;
 
+        case 0x1d:
+            ASSERT(p->tile.max_palette < ARRAY_SIZE(p->tile.raw));
+            if ( subleaf > min_t(uint32_t, p->tile.max_palette,
+                                 ARRAY_SIZE(p->tile.raw) - 1) )
+                return;
+
+            *res = array_access_nospec(p->tile.raw, subleaf);
+            break;
+
+        case 0x1e:
+            if ( subleaf >= ARRAY_SIZE(p->tmul.raw) )
+                return;
+
+            *res = array_access_nospec(p->tmul.raw, subleaf);
+            break;
+
         default:
             *res = array_access_nospec(p->basic.raw, leaf);
             break;
@@ -1145,6 +1193,8 @@ static void __init __maybe_unused build_
                  sizeof(raw_cpuid_policy.feat.raw));
     BUILD_BUG_ON(sizeof(raw_cpuid_policy.xstate) !=
                  sizeof(raw_cpuid_policy.xstate.raw));
+    BUILD_BUG_ON(sizeof(raw_cpuid_policy.tile) !=
+                 sizeof(raw_cpuid_policy.tile.raw));
     BUILD_BUG_ON(sizeof(raw_cpuid_policy.extd) !=
                  sizeof(raw_cpuid_policy.extd.raw));
 }
--- a/xen/include/xen/lib/x86/cpuid.h
+++ b/xen/include/xen/lib/x86/cpuid.h
@@ -78,11 +78,13 @@ unsigned int x86_cpuid_lookup_vendor(uin
  */
 const char *x86_cpuid_vendor_to_str(unsigned int vendor);
 
-#define CPUID_GUEST_NR_BASIC      (0xdu + 1)
+#define CPUID_GUEST_NR_BASIC      (0x1eu + 1)
 #define CPUID_GUEST_NR_CACHE      (5u + 1)
 #define CPUID_GUEST_NR_FEAT       (1u + 1)
 #define CPUID_GUEST_NR_TOPO       (1u + 1)
 #define CPUID_GUEST_NR_XSTATE     (62u + 1)
+#define CPUID_GUEST_NR_PALETTE    (1u + 1)
+#define CPUID_GUEST_NR_TMUL       (0u + 1)
 #define CPUID_GUEST_NR_EXTD_INTEL (0x8u + 1)
 #define CPUID_GUEST_NR_EXTD_AMD   (0x1cu + 1)
 #define CPUID_GUEST_NR_EXTD       MAX(CPUID_GUEST_NR_EXTD_INTEL, \
@@ -225,6 +227,35 @@ struct cpuid_policy
         } comp[CPUID_GUEST_NR_XSTATE];
     } xstate;
 
+    /* Structured tile information leaf: 0x00000001d[xx] */
+    union {
+        struct cpuid_leaf raw[CPUID_GUEST_NR_PALETTE];
+        struct {
+            /* Subleaf 0. */
+            uint32_t max_palette;
+            uint32_t /* b */:32, /* c */:32, /* d */:32;
+        };
+
+        /* Per-palette common state.  Valid for i >= 1. */
+        struct {
+            uint16_t tot_bytes, bytes_per_tile;
+            uint16_t bytes_per_row, num_regs;
+            uint16_t max_rows, :16;
+            uint32_t /* d */:32;
+        } palette[CPUID_GUEST_NR_PALETTE];
+    } tile;
+
+    /* Structured tmul information leaf: 0x00000001e[xx] */
+    union {
+        struct cpuid_leaf raw[CPUID_GUEST_NR_TMUL];
+        struct {
+            /* Subleaf 0. */
+            uint32_t /* a */:32;
+            uint32_t maxk:8, maxn:16, :8;
+            uint32_t /* c */:32, /* d */:32;
+        };
+    } tmul;
+
     /* Extended leaves: 0x800000xx */
     union {
         struct cpuid_leaf raw[CPUID_GUEST_NR_EXTD];
--- a/xen/lib/x86/cpuid.c
+++ b/xen/lib/x86/cpuid.c
@@ -170,6 +170,18 @@ void x86_cpuid_policy_fill_native(struct
         }
     }
 
+    if ( p->basic.max_leaf >= 0x1d )
+    {
+        cpuid_count_leaf(0x1d, 0, &p->tile.raw[0]);
+
+        for ( i = 1; i <= MIN(p->tile.max_palette,
+                              ARRAY_SIZE(p->tile.raw) - 1); ++i )
+            cpuid_count_leaf(0x1d, i, &p->tile.raw[i]);
+    }
+
+    if ( p->basic.max_leaf >= 0x1e )
+        cpuid_count_leaf(0x1e, 0, &p->tmul.raw[0]);
+
     /* Extended leaves. */
     cpuid_leaf(0x80000000, &p->extd.raw[0]);
     for ( i = 1; i <= MIN(p->extd.max_leaf & 0xffffU,
@@ -232,6 +244,19 @@ void x86_cpuid_policy_clear_out_of_range
                     ARRAY_SIZE(p->xstate.raw) - 1);
     }
 
+    if ( p->basic.max_leaf < 0x1d ||
+         (cpuid_policy_xstates(p) &
+          (X86_XCR0_TILECFG | X86_XCR0_TILEDATA)) !=
+         (X86_XCR0_TILECFG | X86_XCR0_TILEDATA) )
+        memset(p->tile.raw, 0, sizeof(p->tile.raw));
+    else
+        zero_leaves(p->tile.raw, p->tile.max_palette + 1,
+                    ARRAY_SIZE(p->tile.raw) - 1);
+
+    if ( p->basic.max_leaf < 0x1e || !p->tile.max_palette ||
+         (!p->feat.amx_int8 && !p->feat.amx_bf16) )
+        memset(p->tmul.raw, 0, sizeof(p->tmul.raw));
+
     zero_leaves(p->extd.raw,
                 ((p->extd.max_leaf >> 16) == 0x8000
                  ? (p->extd.max_leaf & 0xffff) + 1 : 0),
@@ -244,6 +269,8 @@ void x86_cpuid_policy_bound_max_leaves(s
         min_t(uint32_t, p->basic.max_leaf, ARRAY_SIZE(p->basic.raw) - 1);
     p->feat.max_subleaf =
         min_t(uint32_t, p->feat.max_subleaf, ARRAY_SIZE(p->feat.raw) - 1);
+    p->tile.max_palette =
+        min_t(uint32_t, p->tile.max_palette, ARRAY_SIZE(p->tile.raw) - 1);
     p->extd.max_leaf = 0x80000000 | min_t(uint32_t, p->extd.max_leaf & 0xffff,
                                           ARRAY_SIZE(p->extd.raw) - 1);
 }
@@ -271,6 +298,21 @@ void x86_cpuid_policy_shrink_max_leaves(
      */
     p->basic.raw[0xd] = p->xstate.raw[0];
 
+    for ( i = p->tile.max_palette; i; --i )
+        if ( p->tile.raw[i].a | p->tile.raw[i].b |
+             p->tile.raw[i].c | p->tile.raw[i].d )
+            break;
+    if ( i )
+        p->tile.max_palette = i;
+    else
+    {
+        ASSERT(!p->feat.amx_tile);
+        zero_leaves(p->tile.raw, 0, 0);
+    }
+    p->basic.raw[0x1d] = p->tile.raw[0];
+
+    p->basic.raw[0x1e] = p->tmul.raw[0];
+
     for ( i = p->basic.max_leaf; i; --i )
         if ( p->basic.raw[i].a | p->basic.raw[i].b |
              p->basic.raw[i].c | p->basic.raw[i].d )
@@ -404,6 +446,19 @@ int x86_cpuid_copy_to_buffer(const struc
             break;
         }
 
+        case 0x1d:
+            for ( subleaf = 0;
+                  subleaf <= MIN(p->tile.max_palette,
+                                 ARRAY_SIZE(p->tile.raw) - 1); ++subleaf )
+                COPY_LEAF(leaf, subleaf, &p->tile.raw[subleaf]);
+            break;
+
+        case 0x1e:
+            for ( subleaf = 0;
+                  subleaf <= ARRAY_SIZE(p->tmul.raw) - 1; ++subleaf )
+                COPY_LEAF(leaf, subleaf, &p->tmul.raw[subleaf]);
+            break;
+
         default:
             COPY_LEAF(leaf, XEN_CPUID_NO_SUBLEAF, &p->basic.raw[leaf]);
             break;
@@ -496,6 +551,20 @@ int x86_cpuid_copy_from_buffer(struct cp
                 array_access_nospec(p->xstate.raw, data.subleaf) = l;
                 break;
 
+            case 0x1d:
+                if ( data.subleaf >= ARRAY_SIZE(p->tile.raw) )
+                    goto out_of_range;
+
+                array_access_nospec(p->tile.raw, data.subleaf) = l;
+                break;
+
+            case 0x1e:
+                if ( data.subleaf >= ARRAY_SIZE(p->tmul.raw) )
+                    goto out_of_range;
+
+                array_access_nospec(p->tmul.raw, data.subleaf) = l;
+                break;
+
             default:
                 if ( data.subleaf != XEN_CPUID_NO_SUBLEAF )
                     goto out_of_range;
--- a/xen/lib/x86/policy.c
+++ b/xen/lib/x86/policy.c
@@ -7,6 +7,7 @@ int x86_cpu_policies_are_compatible(cons
                                     struct cpu_policy_errors *err)
 {
     struct cpu_policy_errors e = INIT_CPU_POLICY_ERRORS;
+    unsigned int i;
     int ret = -EINVAL;
 
 #define NA XEN_CPUID_NO_SUBLEAF
@@ -21,6 +22,31 @@ int x86_cpu_policies_are_compatible(cons
     if ( guest->cpuid->feat.max_subleaf > host->cpuid->feat.max_subleaf )
         FAIL_CPUID(7, 0);
 
+    if ( (guest->cpuid->feat.amx_tile && !guest->cpuid->tile.max_palette) ||
+         guest->cpuid->tile.max_palette > host->cpuid->tile.max_palette )
+        FAIL_CPUID(0x1d, 0);
+
+    for ( i = 1; i <= guest->cpuid->tile.max_palette; ++i )
+    {
+        const typeof(guest->cpuid->tile.palette[0]) *gt, *ht;
+
+        gt = &guest->cpuid->tile.palette[i];
+        ht = &host->cpuid->tile.palette[i];
+
+        if ( gt->tot_bytes != ht->tot_bytes ||
+             gt->bytes_per_tile != ht->bytes_per_tile ||
+             gt->bytes_per_row != ht->bytes_per_row ||
+             !gt->num_regs || gt->num_regs > ht->num_regs ||
+             !gt->max_rows || gt->max_rows > ht->max_rows )
+            FAIL_CPUID(0x1d, i);
+    }
+
+    if ( ((guest->cpuid->feat.amx_int8 || guest->cpuid->feat.amx_bf16) &&
+          (!guest->cpuid->tmul.maxk || !guest->cpuid->tmul.maxn)) ||
+         guest->cpuid->tmul.maxk > host->cpuid->tmul.maxk ||
+         guest->cpuid->tmul.maxn > host->cpuid->tmul.maxn )
+        FAIL_CPUID(0x1e, 0);
+
     if ( guest->cpuid->extd.max_leaf > host->cpuid->extd.max_leaf )
         FAIL_CPUID(0x80000000, NA);
 
--- a/xen/lib/x86/private.h
+++ b/xen/lib/x86/private.h
@@ -17,13 +17,17 @@
 
 #else
 
+#include <assert.h>
 #include <errno.h>
 #include <inttypes.h>
 #include <stdbool.h>
 #include <stddef.h>
 #include <string.h>
 
+#define ASSERT assert
+
 #include <xen/asm/msr-index.h>
+#include <xen/asm/x86-defns.h>
 #include <xen/asm/x86-vendors.h>
 
 #include <xen-tools/libs.h>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 13/22] x86: XFD enabling
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (11 preceding siblings ...)
  2021-04-22 14:51 ` [PATCH v3 12/22] x86/CPUID: enable AMX leaves Jan Beulich
@ 2021-04-22 14:52 ` Jan Beulich
  2021-04-22 14:53 ` [PATCH v3 14/22] x86emul: introduce X86EMUL_FPU_{tilecfg,tile} Jan Beulich
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:52 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Roger Pau Monné,
	Anthony Perard

Just like for XCR0 we generally want to run with the guest settings of
the involved MSRs while in Xen. Hence saving/restoring of guest values
needs to only happen during context switch.

While adding the feature to libxl's table I've noticed the other XSAVE
sub-features all don't have entries there. These get added at the same
time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.
---
I wasn't sure whether we want to have AMX depend on XFD. Purely from a
spec pov the two are independent features (without XFD one simply
wouldn't have a way to disable it for the purpose of lazy allocation of
state saving space).

--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -237,6 +237,12 @@ int libxl_cpuid_parse_config(libxl_cpuid
         {"fsrs",         0x00000007,  1, CPUID_REG_EAX, 11,  1},
         {"fsrcs",        0x00000007,  1, CPUID_REG_EAX, 12,  1},
 
+        {"xsaveopt",     0x0000000d,  1, CPUID_REG_EAX,  0,  1},
+        {"xsavec",       0x0000000d,  1, CPUID_REG_EAX,  1,  1},
+        {"xgetbv1",      0x0000000d,  1, CPUID_REG_EAX,  2,  1},
+        {"xsaves",       0x0000000d,  1, CPUID_REG_EAX,  3,  1},
+        {"xfd",          0x0000000d,  1, CPUID_REG_EAX,  4,  1},
+
         {"lahfsahf",     0x80000001, NA, CPUID_REG_ECX,  0,  1},
         {"cmplegacy",    0x80000001, NA, CPUID_REG_ECX,  1,  1},
         {"svm",          0x80000001, NA, CPUID_REG_ECX,  2,  1},
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -116,6 +116,7 @@ static const char *const str_Da1[32] =
 {
     [ 0] = "xsaveopt", [ 1] = "xsavec",
     [ 2] = "xgetbv1",  [ 3] = "xsaves",
+    [ 4] = "xfd",
 };
 
 static const char *const str_7c0[32] =
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -230,6 +230,7 @@ static void recalculate_xstate(struct cp
         p->xstate.comp[i].offset = xstate_offset(i);
         p->xstate.comp[i].xss    = curr_xstate & XSTATE_XSAVES_ONLY;
         p->xstate.comp[i].align  = curr_xstate & xstate_align;
+        p->xstate.comp[i].xfd    = curr_xstate & xfd_mask;
     }
 }
 
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1837,6 +1837,12 @@ void paravirt_ctxt_switch_to(struct vcpu
 
     if ( cpu_has_msr_tsc_aux )
         wrmsr_tsc_aux(v->arch.msrs->tsc_aux);
+
+    if ( v->domain->arch.cpuid->xstate.xfd )
+    {
+        wrmsrl(MSR_XFD, v->arch.msrs->xfd);
+        wrmsrl(MSR_XFD_ERR, v->arch.msrs->xfd_err);
+    }
 }
 
 /* Update per-VCPU guest runstate shared memory area (if registered). */
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1101,6 +1101,8 @@ long arch_do_domctl(
         static const uint32_t msrs_to_send[] = {
             MSR_SPEC_CTRL,
             MSR_INTEL_MISC_FEATURES_ENABLES,
+            MSR_XFD,
+            MSR_XFD_ERR,
             MSR_TSC_AUX,
             MSR_AMD64_DR0_ADDRESS_MASK,
             MSR_AMD64_DR1_ADDRESS_MASK,
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -1178,6 +1178,12 @@ static int construct_vmcs(struct vcpu *v
         if ( (vmexit_ctl & VM_EXIT_CLEAR_BNDCFGS) &&
              (vmentry_ctl & VM_ENTRY_LOAD_BNDCFGS) )
             vmx_clear_msr_intercept(v, MSR_IA32_BNDCFGS, VMX_MSR_RW);
+
+        if ( d->arch.cpuid->xstate.xfd )
+        {
+            vmx_clear_msr_intercept(v, MSR_XFD, VMX_MSR_RW);
+            vmx_clear_msr_intercept(v, MSR_XFD_ERR, VMX_MSR_RW);
+        }
     }
 
     /* I/O access bitmap. */
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -539,6 +539,12 @@ static void vmx_save_guest_msrs(struct v
      */
     v->arch.hvm.vmx.shadow_gs = read_gs_shadow();
 
+    if ( v->domain->arch.cpuid->xstate.xfd )
+    {
+        rdmsrl(MSR_XFD, msrs->xfd);
+        rdmsrl(MSR_XFD_ERR, msrs->xfd_err);
+    }
+
     if ( v->arch.hvm.vmx.ipt_active )
     {
         rdmsrl(MSR_RTIT_OUTPUT_MASK, msrs->rtit.output_mask);
@@ -558,6 +564,12 @@ static void vmx_restore_guest_msrs(struc
     if ( cpu_has_msr_tsc_aux )
         wrmsr_tsc_aux(msrs->tsc_aux);
 
+    if ( v->domain->arch.cpuid->xstate.xfd )
+    {
+        wrmsrl(MSR_XFD, msrs->xfd);
+        wrmsrl(MSR_XFD_ERR, msrs->xfd_err);
+    }
+
     if ( v->arch.hvm.vmx.ipt_active )
     {
         wrmsrl(MSR_RTIT_OUTPUT_BASE, page_to_maddr(v->vmtrace.pg));
--- a/xen/arch/x86/msr.c
+++ b/xen/arch/x86/msr.c
@@ -29,6 +29,7 @@
 #include <asm/hvm/viridian.h>
 #include <asm/msr.h>
 #include <asm/setup.h>
+#include <asm/xstate.h>
 
 #include <public/hvm/params.h>
 
@@ -267,6 +268,18 @@ int guest_rdmsr(struct vcpu *v, uint32_t
         *val = msrs->misc_features_enables.raw;
         break;
 
+    case MSR_XFD:
+    case MSR_XFD_ERR:
+        if ( !cp->xstate.xfd )
+            goto gp_fault;
+        if ( v == curr && is_hvm_domain(d) )
+            rdmsrl(msr, *val);
+        else if ( msr == MSR_XFD )
+            *val = msrs->xfd;
+        else
+            *val = msrs->xfd_err;
+        break;
+
     case MSR_IA32_MCG_CAP     ... MSR_IA32_MCG_CTL:      /* 0x179 -> 0x17b */
     case MSR_IA32_MCx_CTL2(0) ... MSR_IA32_MCx_CTL2(31): /* 0x280 -> 0x29f */
     case MSR_IA32_MCx_CTL(0)  ... MSR_IA32_MCx_MISC(31): /* 0x400 -> 0x47f */
@@ -523,6 +536,19 @@ int guest_wrmsr(struct vcpu *v, uint32_t
         break;
     }
 
+    case MSR_XFD:
+    case MSR_XFD_ERR:
+        if ( !cp->xstate.xfd ||
+             (val & ~(cpuid_policy_xstates(cp) & xfd_mask)) )
+            goto gp_fault;
+        if ( v == curr )
+            wrmsrl(msr, val);
+        if ( msr == MSR_XFD )
+            msrs->xfd = val;
+        else
+            msrs->xfd_err = val;
+        break;
+
     case MSR_IA32_MCG_CAP     ... MSR_IA32_MCG_CTL:      /* 0x179 -> 0x17b */
     case MSR_IA32_MCx_CTL2(0) ... MSR_IA32_MCx_CTL2(31): /* 0x280 -> 0x29f */
     case MSR_IA32_MCx_CTL(0)  ... MSR_IA32_MCx_MISC(31): /* 0x400 -> 0x47f */
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -29,6 +29,10 @@
 u64 __read_mostly xfeature_mask;
 
 u64 __read_mostly xstate_align;
+
+/* Mask of XSAVE/XRSTOR features supporting XFD. */
+uint64_t __read_mostly xfd_mask;
+
 static unsigned int __read_mostly xstate_features;
 
 uint32_t __read_mostly mxcsr_mask = 0x0000ffbf;
@@ -106,6 +110,11 @@ static int setup_xstate_features(bool bs
             __set_bit(leaf, &xstate_align);
         else
             BUG_ON(!(ecx & XSTATE_ALIGN64) != !test_bit(leaf, &xstate_align));
+
+        if ( bsp && (ecx & XSTATE_XFD) )
+            __set_bit(leaf, &xfd_mask);
+        else
+            BUG_ON(!(ecx & XSTATE_XFD) != !test_bit(leaf, &xfd_mask));
     }
 
     return 0;
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -69,6 +69,10 @@
 #define MSR_MCU_OPT_CTRL                    0x00000123
 #define  MCU_OPT_CTRL_RNGDS_MITG_DIS        (_AC(1, ULL) <<  0)
 
+/* Bits within these two MSRs share positions with XCR0. */
+#define MSR_XFD                             0x000001c4
+#define MSR_XFD_ERR                         0x000001c5
+
 #define MSR_RTIT_OUTPUT_BASE                0x00000560
 #define MSR_RTIT_OUTPUT_MASK                0x00000561
 #define MSR_RTIT_CTL                        0x00000570
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -343,6 +343,12 @@ struct vcpu_msrs
         uint64_t raw;
     } xss;
 
+    /* 0x000001c4 - MSR_XFD */
+    uint64_t xfd;
+
+    /* 0x000001c5 - MSR_XFD_ERR */
+    uint64_t xfd_err;
+
     /*
      * 0xc0000103 - MSR_TSC_AUX
      *
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -39,9 +39,11 @@ extern uint32_t mxcsr_mask;
 #define XSTATE_COMPACTION_ENABLED  (1ULL << 63)
 
 #define XSTATE_ALIGN64 (1U << 1)
+#define XSTATE_XFD     (1U << 2)
 
 extern u64 xfeature_mask;
 extern u64 xstate_align;
+extern uint64_t xfd_mask;
 
 #define xstate_offset(n) (raw_cpuid_policy.xstate.comp[n].offset)
 #define xstate_size(n)   (raw_cpuid_policy.xstate.comp[n].size)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -191,6 +191,7 @@ XEN_CPUFEATURE(XSAVEOPT,      4*32+ 0) /
 XEN_CPUFEATURE(XSAVEC,        4*32+ 1) /*A  XSAVEC/XRSTORC instructions */
 XEN_CPUFEATURE(XGETBV1,       4*32+ 2) /*A  XGETBV with %ecx=1 */
 XEN_CPUFEATURE(XSAVES,        4*32+ 3) /*S  XSAVES/XRSTORS instructions */
+XEN_CPUFEATURE(XFD,           4*32+ 4) /*   XFD / XFD_ERR MSRs */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.ebx, word 5 */
 XEN_CPUFEATURE(FSGSBASE,      5*32+ 0) /*A  {RD,WR}{FS,GS}BASE instructions */
--- a/xen/include/xen/lib/x86/cpuid.h
+++ b/xen/include/xen/lib/x86/cpuid.h
@@ -222,7 +222,7 @@ struct cpuid_policy
         /* Per-component common state.  Valid for i >= 2. */
         struct {
             uint32_t size, offset;
-            bool xss:1, align:1;
+            bool xss:1, align:1, xfd:1;
             uint32_t _res_d;
         } comp[CPUID_GUEST_NR_XSTATE];
     } xstate;
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -221,7 +221,7 @@ def crunch_numbers(state):
         # are instructions built on top of base XSAVE, while others are new
         # instruction groups which are specified to require XSAVE for state
         # management.
-        XSAVE: [XSAVEOPT, XSAVEC, XGETBV1, XSAVES,
+        XSAVE: [XSAVEOPT, XSAVEC, XGETBV1, XSAVES, XFD,
                 AVX, MPX, PKU, AMX_TILE, LWP],
 
         # AVX is taken to mean hardware support for 256bit registers (which in



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 14/22] x86emul: introduce X86EMUL_FPU_{tilecfg,tile}
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (12 preceding siblings ...)
  2021-04-22 14:52 ` [PATCH v3 13/22] x86: XFD enabling Jan Beulich
@ 2021-04-22 14:53 ` Jan Beulich
  2021-04-22 14:53 ` [PATCH v3 15/22] x86emul: support TILERELEASE Jan Beulich
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:53 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

These will be used by AMX insns. They're not sensitive to CR0.TS, but
instead some are sensitive to XFD.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Separate X86EMUL_FPU_tilecfg. Use XFD alternative logic instead of
    checking CR0.TS.
v2: New.

--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -1420,6 +1420,13 @@ static int _get_fpu(
             return X86EMUL_UNHANDLEABLE;
         break;
 
+    case X86EMUL_FPU_tilecfg:
+    case X86EMUL_FPU_tile:
+        ASSERT(mode_64bit());
+        if ( !(xcr0 & X86_XCR0_TILECFG) || !(xcr0 & X86_XCR0_TILEDATA) )
+            return X86EMUL_UNHANDLEABLE;
+        break;
+
     default:
         break;
     }
@@ -1429,6 +1436,7 @@ static int _get_fpu(
     if ( rc == X86EMUL_OKAY )
     {
         unsigned long cr0;
+        uint64_t xcr0_needed = 0;
 
         fail_if(type == X86EMUL_FPU_fpu && !ops->put_fpu);
 
@@ -1453,15 +1461,45 @@ static int _get_fpu(
             /* Should be unreachable if VEX decoding is working correctly. */
             ASSERT((cr0 & X86_CR0_PE) && !(ctxt->regs->eflags & X86_EFLAGS_VM));
         }
-        if ( cr0 & X86_CR0_EM )
+
+        switch ( type )
+        {
+        default:
+            if ( cr0 & X86_CR0_EM )
+            {
+                generate_exception_if(type == X86EMUL_FPU_fpu, EXC_NM);
+                generate_exception_if(type == X86EMUL_FPU_mmx, EXC_UD);
+                generate_exception_if(type == X86EMUL_FPU_xmm, EXC_UD);
+            }
+            generate_exception_if((cr0 & X86_CR0_TS) &&
+                                  (type != X86EMUL_FPU_wait ||
+                                   (cr0 & X86_CR0_MP)),
+                                  EXC_NM);
+            break;
+
+        case X86EMUL_FPU_tilecfg:
+            break;
+
+        case X86EMUL_FPU_tile:
+            xcr0_needed = X86_XCR0_TILEDATA;
+            break;
+        }
+
+        if ( xcr0_needed && ctxt->cpuid->xstate.xfd )
         {
-            generate_exception_if(type == X86EMUL_FPU_fpu, EXC_NM);
-            generate_exception_if(type == X86EMUL_FPU_mmx, EXC_UD);
-            generate_exception_if(type == X86EMUL_FPU_xmm, EXC_UD);
+            uint64_t xfd;
+
+            fail_if(!ops->read_msr);
+            rc = ops->read_msr(MSR_XFD, &xfd, ctxt);
+            if ( rc == X86EMUL_OKAY && (xfd & xcr0_needed) )
+            {
+                fail_if(!ops->write_msr);
+                rc = ops->read_msr(MSR_XFD_ERR, &xfd, ctxt);
+                if ( rc == X86EMUL_OKAY )
+                    rc = ops->write_msr(MSR_XFD_ERR, xfd | xcr0_needed, ctxt);
+                generate_exception_if(rc == X86EMUL_OKAY, EXC_NM);
+            }
         }
-        generate_exception_if((cr0 & X86_CR0_TS) &&
-                              (type != X86EMUL_FPU_wait || (cr0 & X86_CR0_MP)),
-                              EXC_NM);
     }
 
  done:
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -172,6 +172,8 @@ enum x86_emulate_fpu_type {
     X86EMUL_FPU_ymm, /* AVX/XOP instruction set (%ymm0-%ymm7/15) */
     X86EMUL_FPU_opmask, /* AVX512 opmask instruction set (%k0-%k7) */
     X86EMUL_FPU_zmm, /* AVX512 instruction set (%zmm0-%zmm7/31) */
+    X86EMUL_FPU_tilecfg, /* AMX configuration (tilecfg) */
+    X86EMUL_FPU_tile, /* AMX instruction set (%tmm0-%tmmN) */
     /* This sentinel will never be passed to ->get_fpu(). */
     X86EMUL_FPU_none
 };



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 15/22] x86emul: support TILERELEASE
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (13 preceding siblings ...)
  2021-04-22 14:53 ` [PATCH v3 14/22] x86emul: introduce X86EMUL_FPU_{tilecfg,tile} Jan Beulich
@ 2021-04-22 14:53 ` Jan Beulich
  2021-04-22 14:53 ` [PATCH v3 16/22] x86: introduce struct for TILECFG register Jan Beulich
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:53 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

This is relatively straightforward, and hence best suited to introduce a
few other general pieces.

Testing of this will be added once a sensible test can be put together,
i.e. when support for other insns is also there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/tools/tests/x86_emulator/predicates.c
+++ b/tools/tests/x86_emulator/predicates.c
@@ -1335,6 +1335,7 @@ static const struct vex {
     { { 0x45 }, 2, T, R, pfx_66, Wn, Ln }, /* vpsrlv{d,q} */
     { { 0x46 }, 2, T, R, pfx_66, W0, Ln }, /* vpsravd */
     { { 0x47 }, 2, T, R, pfx_66, Wn, Ln }, /* vpsllv{d,q} */
+    { { 0x49, 0xc0 }, 2, F, N, pfx_no, W0, L0 }, /* tilerelease */
     { { 0x50 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusd */
     { { 0x51 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusds */
     { { 0x52 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpwssd */
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -247,6 +247,10 @@ int emul_test_get_fpu(
             break;
     default:
         return X86EMUL_UNHANDLEABLE;
+
+    case X86EMUL_FPU_tilecfg:
+    case X86EMUL_FPU_tile:
+        return cpu_has_amx_tile ? X86EMUL_OKAY : X86EMUL_UNHANDLEABLE;
     }
     return X86EMUL_OKAY;
 }
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -475,6 +475,7 @@ static const struct ext0f38_table {
     [0x43] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x44] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x49] = { .simd_size = simd_other, .two_op = 1 },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -2046,6 +2047,7 @@ amd_like(const struct x86_emulate_ctxt *
 #define vcpu_has_avx512_4fmaps() (ctxt->cpuid->feat.avx512_4fmaps)
 #define vcpu_has_avx512_vp2intersect() (ctxt->cpuid->feat.avx512_vp2intersect)
 #define vcpu_has_serialize()   (ctxt->cpuid->feat.serialize)
+#define vcpu_has_amx_tile()    (ctxt->cpuid->feat.amx_tile)
 #define vcpu_has_avx_vnni()    (ctxt->cpuid->feat.avx_vnni)
 #define vcpu_has_avx512_bf16() (ctxt->cpuid->feat.avx512_bf16)
 
@@ -9500,6 +9502,24 @@ x86_emulate(
         generate_exception_if(vex.l, EXC_UD);
         goto simd_0f_avx;
 
+    case X86EMUL_OPC_VEX(0x0f38, 0x49):
+        generate_exception_if(!mode_64bit() || vex.l || vex.w, EXC_UD);
+        if ( ea.type == OP_REG )
+        {
+            switch ( modrm )
+            {
+            case 0xc0: /* tilerelease */
+                host_and_vcpu_must_have(amx_tile);
+                get_fpu(X86EMUL_FPU_tilecfg);
+                op_bytes = 1; /* fake */
+                goto simd_0f_common;
+
+            default:
+                goto unrecognized_insn;
+            }
+        }
+        goto unimplemented_insn;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x50): /* vpdpbusd [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x51): /* vpdpbusds [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x52): /* vpdpwssd [xy]mm/mem,[xy]mm,[xy]mm */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -133,6 +133,7 @@
 #define cpu_has_avx512_vp2intersect boot_cpu_has(X86_FEATURE_AVX512_VP2INTERSECT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
+#define cpu_has_amx_tile        boot_cpu_has(X86_FEATURE_AMX_TILE)
 
 /* CPUID level 0x00000007:1.eax */
 #define cpu_has_avx_vnni        boot_cpu_has(X86_FEATURE_AVX_VNNI)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 16/22] x86: introduce struct for TILECFG register
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (14 preceding siblings ...)
  2021-04-22 14:53 ` [PATCH v3 15/22] x86emul: support TILERELEASE Jan Beulich
@ 2021-04-22 14:53 ` Jan Beulich
  2021-04-22 14:54 ` [PATCH v3 17/22] x86emul: support {LD,ST}TILECFG Jan Beulich
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:53 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

Introduce a new x86-types.h to hold various architectural type
definitions, the TILECFG register layout being the first. Arrange for
the insn emulator to include this header.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/fuzz/x86_instruction_emulator/Makefile
+++ b/tools/fuzz/x86_instruction_emulator/Makefile
@@ -26,7 +26,7 @@ GCOV_FLAGS := --coverage
 	$(CC) -c $(CFLAGS) $(GCOV_FLAGS) $< -o $@
 
 x86.h := $(addprefix $(XEN_ROOT)/tools/include/xen/asm/,\
-                     x86-vendors.h x86-defns.h msr-index.h) \
+                     x86-vendors.h x86-types.h x86-defns.h msr-index.h) \
          $(addprefix $(XEN_ROOT)/tools/include/xen/lib/x86/, \
                      cpuid.h cpuid-autogen.h)
 x86_emulate.h := x86-emulate.h x86_emulate/x86_emulate.h $(x86.h)
--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -284,7 +284,7 @@ $(call cc-option-add,HOSTCFLAGS-x86_64,H
 HOSTCFLAGS += $(CFLAGS_xeninclude) -I. $(HOSTCFLAGS-$(XEN_COMPILE_ARCH))
 
 x86.h := $(addprefix $(XEN_ROOT)/tools/include/xen/asm/,\
-                     x86-vendors.h x86-defns.h msr-index.h) \
+                     x86-vendors.h x86-types.h x86-defns.h msr-index.h) \
          $(addprefix $(XEN_ROOT)/tools/include/xen/lib/x86/, \
                      cpuid.h cpuid-autogen.h)
 x86_emulate.h := x86-emulate.h x86_emulate/x86_emulate.h $(x86.h)
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -37,6 +37,7 @@
 
 #include <xen/asm/msr-index.h>
 #include <xen/asm/x86-defns.h>
+#include <xen/asm/x86-types.h>
 #include <xen/asm/x86-vendors.h>
 
 #include <xen-tools/libs.h>
--- a/xen/arch/x86/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate.c
@@ -17,6 +17,7 @@
 #include <asm/xstate.h>
 #include <asm/amd.h> /* cpu_has_amd_erratum() */
 #include <asm/debugreg.h>
+#include <asm/x86-types.h>
 
 /* Avoid namespace pollution. */
 #undef cmpxchg
--- /dev/null
+++ b/xen/include/asm-x86/x86-types.h
@@ -0,0 +1,14 @@
+#ifndef __XEN_X86_TYPES_H__
+#define __XEN_X86_TYPES_H__
+
+/*
+ * TILECFG register
+ */
+struct x86_tilecfg {
+    uint8_t palette, start_row;
+    uint8_t res[14];
+    uint16_t colsb[16];
+    uint8_t rows[16];
+};
+
+#endif	/* __XEN_X86_TYPES_H__ */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 17/22] x86emul: support {LD,ST}TILECFG
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (15 preceding siblings ...)
  2021-04-22 14:53 ` [PATCH v3 16/22] x86: introduce struct for TILECFG register Jan Beulich
@ 2021-04-22 14:54 ` Jan Beulich
  2021-04-22 14:55 ` [PATCH v3 18/22] x86emul: support TILEZERO Jan Beulich
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

While ver 043 of the ISA extensions doc also specifies
xcr0_supports_palette() returning false as one of the #GP(0) reasons for
LDTILECFG, the earlier #UD / #GP conditions look to make this fully
dead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Rebase over struct x86_tilecfg introduction.
v2: New.
---
SDE: -spr

--- a/tools/tests/x86_emulator/predicates.c
+++ b/tools/tests/x86_emulator/predicates.c
@@ -1335,6 +1335,8 @@ static const struct vex {
     { { 0x45 }, 2, T, R, pfx_66, Wn, Ln }, /* vpsrlv{d,q} */
     { { 0x46 }, 2, T, R, pfx_66, W0, Ln }, /* vpsravd */
     { { 0x47 }, 2, T, R, pfx_66, Wn, Ln }, /* vpsllv{d,q} */
+    { { 0x49, 0x00 }, 2, F, R, pfx_no, W0, L0 }, /* ldtilecfg */
+    { { 0x49, 0x00 }, 2, F, W, pfx_66, W0, L0 }, /* sttilecfg */
     { { 0x49, 0xc0 }, 2, F, N, pfx_no, W0, L0 }, /* tilerelease */
     { { 0x50 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusd */
     { { 0x51 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusds */
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -898,6 +898,11 @@ int main(int argc, char **argv)
     int rc;
 #ifdef __x86_64__
     unsigned int vendor_native;
+    static const struct x86_tilecfg tilecfg = {
+        .palette = 1,
+        .colsb = { 2, 4, 5, 3 },
+        .rows = { 2, 4, 3, 5 },
+    };
 #else
     unsigned int bcdres_native, bcdres_emul;
 #endif
@@ -4463,6 +4468,74 @@ int main(int argc, char **argv)
         printf("skipped\n");
 
 #ifdef __x86_64__
+    printf("%-40s", "Testing tilerelease;sttilecfg 4(%rcx)...");
+    if ( stack_exec && cpu_has_amx_tile )
+    {
+        decl_insn(tilerelease);
+
+        asm volatile ( put_insn(tilerelease,
+                                /* tilerelease */
+                                ".byte 0xC4, 0xE2, 0x78, 0x49, 0xC0;"
+                                /* sttilecfg 4(%0) */
+                                ".byte 0xC4, 0xE2, 0x79, 0x49, 0x41, 0x04")
+                                :: "c" (NULL) );
+
+        memset(res, ~0, 72);
+        set_insn(tilerelease);
+        regs.ecx = (unsigned long)res;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc == X86EMUL_OKAY )
+            rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(tilerelease) ||
+             ~res[0] || ~res[17] || memchr_inv(res + 1, 0, 64) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing ldtilecfg (%rdx)...");
+    if ( stack_exec && cpu_has_amx_tile )
+    {
+        decl_insn(ldtilecfg);
+
+        asm volatile ( put_insn(ldtilecfg,
+                                /* ldtilecfg (%0) */
+                                ".byte 0xC4, 0xE2, 0x78, 0x49, 0x02")
+                                :: "d" (NULL) );
+
+        set_insn(ldtilecfg);
+        regs.edx = (unsigned long)&tilecfg;
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(ldtilecfg) )
+            goto fail;
+        printf("pending\n");
+    }
+    else
+        printf("skipped\n");
+
+    printf("%-40s", "Testing sttilecfg -4(%rcx)...");
+    if ( stack_exec && cpu_has_amx_tile )
+    {
+        decl_insn(sttilecfg);
+
+        asm volatile ( put_insn(sttilecfg,
+                                /* sttilecfg -4(%0) */
+                                ".byte 0xC4, 0xE2, 0x79, 0x49, 0x41, 0xFC")
+                                :: "c" (NULL) );
+
+        memset(res, ~0, 72);
+        set_insn(sttilecfg);
+        regs.ecx = (unsigned long)(res + 2);
+        rc = x86_emulate(&ctxt, &emulops);
+        if ( rc != X86EMUL_OKAY || !check_eip(sttilecfg) ||
+             ~res[0] || ~res[17] || memcmp(res + 1, &tilecfg, 64) )
+            goto fail;
+        printf("okay\n");
+    }
+    else
+        printf("skipped\n");
+
     printf("%-40s", "Testing vzeroupper (compat)...");
     if ( cpu_has_avx )
     {
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -68,6 +68,17 @@
 
 #define is_canonical_address(x) (((int64_t)(x) >> 47) == ((int64_t)(x) >> 63))
 
+static inline void *memchr_inv(const void *s, int c, size_t n)
+{
+    const unsigned char *p = s;
+
+    while ( n-- )
+        if ( (unsigned char)c != *p++ )
+            return (void *)(p - 1);
+
+    return NULL;
+}
+
 extern uint32_t mxcsr_mask;
 extern struct cpuid_policy cp;
 
@@ -171,6 +182,8 @@ static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512_4fmaps (cp.feat.avx512_4fmaps && xcr0_mask(0xe6))
 #define cpu_has_avx512_vp2intersect (cp.feat.avx512_vp2intersect && xcr0_mask(0xe6))
 #define cpu_has_serialize  cp.feat.serialize
+#define cpu_has_amx_tile   (cp.feat.amx_tile && \
+                            xcr0_mask(X86_XCR0_TILECFG | X86_XCR0_TILEDATA))
 #define cpu_has_avx_vnni   (cp.feat.avx_vnni && xcr0_mask(6))
 #define cpu_has_avx512_bf16 (cp.feat.avx512_bf16 && xcr0_mask(0xe6))
 
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -957,6 +957,7 @@ typedef union {
     uint64_t __attribute__ ((aligned(16))) xmm[2];
     uint64_t __attribute__ ((aligned(32))) ymm[4];
     uint64_t __attribute__ ((aligned(64))) zmm[8];
+    struct x86_tilecfg tilecfg;
     uint32_t data32[16];
 } mmval_t;
 
@@ -2880,6 +2881,10 @@ x86_decode_0f38(
         state->simd_size = simd_scalar_vexw;
         break;
 
+    case X86EMUL_OPC_VEX_66(0, 0x49): /* sttilecfg */
+        state->desc = DstMem | SrcImplicit | Mov;
+        break;
+
     case X86EMUL_OPC_EVEX_66(0, 0x7a): /* vpbroadcastb */
     case X86EMUL_OPC_EVEX_66(0, 0x7b): /* vpbroadcastw */
     case X86EMUL_OPC_EVEX_66(0, 0x7c): /* vpbroadcast{d,q} */
@@ -9518,7 +9523,66 @@ x86_emulate(
                 goto unrecognized_insn;
             }
         }
-        goto unimplemented_insn;
+
+        switch ( modrm_reg & 7 )
+        {
+        case 0: /* ldtilecfg mem */
+            generate_exception_if(vex.reg != 0xf, EXC_UD);
+            host_and_vcpu_must_have(amx_tile);
+            get_fpu(X86EMUL_FPU_tilecfg);
+            rc = ops->read(ea.mem.seg, ea.mem.off, mmvalp, 64, ctxt);
+            if ( rc != X86EMUL_OKAY )
+                goto done;
+            generate_exception_if((mmvalp->tilecfg.palette >
+                                   ctxt->cpuid->tile.max_palette),
+                                  EXC_GP, 0);
+            if ( mmvalp->tilecfg.palette )
+            {
+                const typeof(*ctxt->cpuid->tile.palette) *palette;
+
+                generate_exception_if(memchr_inv(mmvalp->tilecfg.res, 0,
+                                                 sizeof(mmvalp->tilecfg.res)),
+                                      EXC_GP, 0);
+
+                /*
+                 * Parameters for valid registers must be within bounds, or
+                 * both be zero at the same time.
+                 */
+                palette = &ctxt->cpuid->tile.palette[mmvalp->tilecfg.palette];
+                for ( i = 0; i < palette->num_regs; ++i )
+                    generate_exception_if(((mmvalp->tilecfg.colsb[i] >
+                                            palette->bytes_per_row) ||
+                                           (mmvalp->tilecfg.rows[i] >
+                                            palette->max_rows) ||
+                                           (!mmvalp->tilecfg.colsb[i] !=
+                                            !mmvalp->tilecfg.rows[i])),
+                                          EXC_GP, 0);
+
+                /* All remaining entries must be zero. */
+                for ( ; i < 16; ++i )
+                    generate_exception_if((mmvalp->tilecfg.colsb[i] ||
+                                           mmvalp->tilecfg.rows[i]),
+                                          EXC_GP, 0);
+            }
+            op_bytes = 64;
+            goto simd_0f_common;
+        }
+        goto unrecognized_insn;
+
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x49):
+        generate_exception_if(!mode_64bit() || vex.l || vex.w, EXC_UD);
+        if ( ea.type == OP_REG )
+            goto unrecognized_insn;
+
+        switch ( modrm_reg & 7 )
+        {
+        case 0: /* sttilecfg mem */
+            host_and_vcpu_must_have(amx_tile);
+            get_fpu(X86EMUL_FPU_tilecfg);
+            op_bytes = 64;
+            goto simd_0f_common;
+        }
+        goto unrecognized_insn;
 
     case X86EMUL_OPC_VEX_66(0x0f38, 0x50): /* vpdpbusd [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x51): /* vpdpbusds [xy]mm/mem,[xy]mm,[xy]mm */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 18/22] x86emul: support TILEZERO
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (16 preceding siblings ...)
  2021-04-22 14:54 ` [PATCH v3 17/22] x86emul: support {LD,ST}TILECFG Jan Beulich
@ 2021-04-22 14:55 ` Jan Beulich
  2021-04-22 14:55 ` [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE Jan Beulich
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:55 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

This is relatively straightforward, and hence best suited to introduce a
few other wider use pieces.

Testing of this will be added once a sensible test can be put together,
i.e. when support for at least TILELOADD (to allow loading non-zero
values in the first place) is also there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/predicates.c
+++ b/tools/tests/x86_emulator/predicates.c
@@ -1338,6 +1338,7 @@ static const struct vex {
     { { 0x49, 0x00 }, 2, F, R, pfx_no, W0, L0 }, /* ldtilecfg */
     { { 0x49, 0x00 }, 2, F, W, pfx_66, W0, L0 }, /* sttilecfg */
     { { 0x49, 0xc0 }, 2, F, N, pfx_no, W0, L0 }, /* tilerelease */
+    { { 0x49, 0xc0 }, 2, F, N, pfx_f2, W0, L0 }, /* tilezero */
     { { 0x50 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusd */
     { { 0x51 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusds */
     { { 0x52 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpwssd */
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -3725,6 +3725,31 @@ x86_decode(
 #undef insn_fetch_bytes
 #undef insn_fetch_type
 
+#ifndef X86EMUL_NO_SIMD
+
+static void sttilecfg(struct x86_tilecfg *tilecfg)
+{
+    /* sttilecfg (%rdi) */
+    asm volatile ( ".byte 0xc4, 0xe2, 0x79, 0x49, 0x07"
+                   : "=m" (*tilecfg) : "D" (tilecfg) );
+}
+
+static bool tiles_configured(const struct x86_tilecfg *tilecfg)
+{
+    return tilecfg->palette;
+}
+
+static bool tile_valid(unsigned int tile, const struct x86_tilecfg *tilecfg)
+{
+    /*
+     * Considering the checking LDTILECFG does, checking either would in
+     * principle be sufficient.
+     */
+    return tilecfg->colsb[tile] && tilecfg->rows[tile];
+}
+
+#endif /* X86EMUL_NO_SIMD */
+
 /* Undo DEBUG wrapper. */
 #undef x86_emulate
 
@@ -9584,6 +9609,29 @@ x86_emulate(
         }
         goto unrecognized_insn;
 
+    case X86EMUL_OPC_VEX_F2(0x0f38, 0x49):
+        generate_exception_if(!mode_64bit() || vex.l || vex.w, EXC_UD);
+        if ( ea.type == OP_REG )
+        {
+            switch ( modrm_rm & 7 )
+            {
+            case 0: /* tilezero */
+                host_and_vcpu_must_have(amx_tile);
+                get_fpu(X86EMUL_FPU_tile);
+                sttilecfg(&mmvalp->tilecfg);
+                generate_exception_if(!tiles_configured(&mmvalp->tilecfg),
+                                      EXC_UD);
+                generate_exception_if(!tile_valid(modrm_reg, &mmvalp->tilecfg),
+                                      EXC_UD);
+                op_bytes = 1; /* fake */
+                goto simd_0f_common;
+
+            default:
+                goto unrecognized_insn;
+            }
+        }
+        goto unrecognized_insn;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x50): /* vpdpbusd [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x51): /* vpdpbusds [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x52): /* vpdpwssd [xy]mm/mem,[xy]mm,[xy]mm */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (17 preceding siblings ...)
  2021-04-22 14:55 ` [PATCH v3 18/22] x86emul: support TILEZERO Jan Beulich
@ 2021-04-22 14:55 ` Jan Beulich
  2021-04-22 15:06   ` Jan Beulich
  2021-04-22 14:56 ` [PATCH v3 20/22] x86emul: support tile multiplication insns Jan Beulich
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:55 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

In order to be flexible about future growth of tile dimensions, use a
separately allocated scratch buffer to read/write individual rows of a
tile.

Note that the separate write_tilecfg() function is needed because
LDTILECFG wouldn't serve the purpose: It clears various state, unlike
XRSTOR / XRSTORC. To keep things simple, the test harness variant of it
leverages the state save/restore around library calls. To be sure the
actual data gets restored (and not init state put in place), any extra
override for the XSTATE_BV field is introduced.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.
---
TBD: ISA extensions document 043 doesn't clarify either way whether
     duplicate stores to the same (linear) memory ranges may get
     squashed by TILESTORED just like is documented for AVX-512
     scatters. For now similar behavior is assumed (or else
     hvm/emulate.c's respective checking would need altering, or we'd
     need to exit back to guest context after every individual row,
     albeit that would mean allowing interrupts to occur in the middle
     of an insn); an inquiry on the public forum [1] was left
     unanswered so far.

[1] https://community.intel.com/t5/Intel-ISA-Extensions/TILESTORED-to-overlapping-addresses/td-p/1226953

--- a/tools/tests/x86_emulator/predicates.c
+++ b/tools/tests/x86_emulator/predicates.c
@@ -1339,6 +1339,9 @@ static const struct vex {
     { { 0x49, 0x00 }, 2, F, W, pfx_66, W0, L0 }, /* sttilecfg */
     { { 0x49, 0xc0 }, 2, F, N, pfx_no, W0, L0 }, /* tilerelease */
     { { 0x49, 0xc0 }, 2, F, N, pfx_f2, W0, L0 }, /* tilezero */
+    { { 0x4b, VSIB(0) }, 3, F, R, pfx_66, W0, L0 }, /* tileloaddt1 */
+    { { 0x4b, VSIB(0) }, 3, F, W, pfx_f3, W0, L0 }, /* tilestored */
+    { { 0x4b, VSIB(0) }, 3, F, R, pfx_f2, W0, L0 }, /* tileloadd */
     { { 0x50 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusd */
     { { 0x51 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpbusds */
     { { 0x52 }, 2, T, R, pfx_66, W0, Ln }, /* vpdpwssd */
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -35,6 +35,8 @@ struct cpuid_policy cp;
 
 static char fpu_save_area[0x4000] __attribute__((__aligned__((64))));
 static bool use_xsave;
+/* This will get OR-ed into xstate_bv prior to restoring state. */
+static uint64_t xstate_bv_or_mask;
 
 /*
  * Re-use the area above also as scratch space for the emulator itself.
@@ -57,13 +59,19 @@ void emul_restore_fpu_state(void)
 {
     /* Older gcc can't deal with "m" array inputs; make them outputs instead. */
     if ( use_xsave )
+    {
+        *(uint64_t *)&fpu_save_area[0x200] |= xstate_bv_or_mask;
+        xstate_bv_or_mask = 0;
         asm volatile ( "xrstor %[ptr]"
                        : [ptr] "+m" (fpu_save_area)
                        : "a" (~0ul), "d" (~0ul) );
+    }
     else
         asm volatile ( "fxrstor %0" : "+m" (fpu_save_area) );
 }
 
+static void *tile_row;
+
 bool emul_test_init(void)
 {
     union {
@@ -121,6 +129,19 @@ bool emul_test_init(void)
     if ( fxs->mxcsr_mask )
         mxcsr_mask = fxs->mxcsr_mask;
 
+    if ( cpu_has_amx_tile )
+    {
+        unsigned int i, max_bytes = 0;
+
+        for ( i = 1; i <= cp.tile.max_palette; ++i )
+            if ( cp.tile.palette[i].bytes_per_row > max_bytes )
+                max_bytes = cp.tile.palette[i].bytes_per_row;
+
+        if ( !cp.xstate.comp[X86_XCR0_TILECFG_POS].offset ||
+             !max_bytes || !(tile_row = malloc(max_bytes)) )
+            cp.feat.amx_tile = false;
+    }
+
     /*
      * Mark the entire stack executable so that the stub executions
      * don't fault
@@ -263,4 +284,22 @@ void emul_test_put_fpu(
     /* TBD */
 }
 
+static void *get_tile_row_buf(void)
+{
+    return tile_row;
+}
+
+WRAPPER(memcpy);
+
+static void write_tilecfg(const struct x86_tilecfg *tilecfg)
+{
+    /*
+     * Leverage the fact that the wrapper (saves and) restores all extended
+     * state around the actual library call.
+     */
+    xstate_bv_or_mask = X86_XCR0_TILECFG;
+    emul_memcpy(fpu_save_area + cp.xstate.comp[X86_XCR0_TILECFG_POS].offset,
+                tilecfg, sizeof(*tilecfg));
+}
+
 #include "x86_emulate/x86_emulate.c"
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -98,8 +98,9 @@ void emul_restore_fpu_state(void);
 # if 0 /* This only works for explicit calls, not for compiler generated ones. */
 #  define WRAP(x) typeof(x) x asm("emul_" #x)
 # else
-# define WRAP(x) asm(".equ " #x ", emul_" #x)
+#  define WRAP(x) asm(".equ " #x ", emul_" #x)
 # endif
+# define WRAPPER(x) typeof(x) emul_##x
 #endif
 
 WRAP(fwrite);
--- a/xen/arch/x86/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate.c
@@ -9,6 +9,7 @@
  *    Keir Fraser <keir@xen.org>
  */
 
+#include <xen/cpu.h>
 #include <xen/domain_page.h>
 #include <xen/err.h>
 #include <xen/event.h>
@@ -52,6 +53,71 @@
 # define X86EMUL_NO_SIMD
 #endif
 
+#ifndef X86EMUL_NO_SIMD
+
+static DEFINE_PER_CPU_READ_MOSTLY(void *, tile_row);
+static unsigned int __read_mostly tile_row_max_bytes;
+
+static int amx_cpu_init(struct notifier_block *nfb,
+                        unsigned long action, void *hcpu)
+{
+    int rc = 0;
+    void **prow = &per_cpu(tile_row, (unsigned long)hcpu);
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        *prow = xmalloc_array(uint8_t, tile_row_max_bytes);
+        if ( !*prow )
+            rc = -ENOMEM;
+        break;
+
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        XFREE(*prow);
+        break;
+    }
+
+    return !rc ? NOTIFY_DONE : notifier_from_errno(rc);
+}
+
+static struct notifier_block amx_cpu_nfb = {
+    .notifier_call = amx_cpu_init,
+};
+
+static int __init amx_init(void)
+{
+    const struct cpuid_policy *p = &host_cpuid_policy;
+    unsigned int i;
+
+    if ( !cpu_has_amx_tile )
+        return 0;
+
+    for ( i = 1; i <= p->tile.max_palette; ++i )
+        if ( p->tile.palette[i].bytes_per_row > tile_row_max_bytes )
+            tile_row_max_bytes = p->tile.palette[i].bytes_per_row;
+
+    if ( !tile_row_max_bytes )
+    {
+        setup_clear_cpu_cap(X86_FEATURE_AMX_TILE);
+        return 0;
+    }
+
+    amx_cpu_init(&amx_cpu_nfb, CPU_UP_PREPARE,
+                 (void *)(unsigned long)smp_processor_id());
+    register_cpu_notifier(&amx_cpu_nfb);
+
+    return 0;
+}
+presmp_initcall(amx_init);
+
+static void *get_tile_row_buf(void)
+{
+    return this_cpu(tile_row);
+}
+
+#endif /* X86EMUL_NO_SIMD */
+
 #include "x86_emulate/x86_emulate.c"
 
 int x86emul_read_xcr(unsigned int reg, uint64_t *val,
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -476,6 +476,7 @@ static const struct ext0f38_table {
     [0x44] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x49] = { .simd_size = simd_other, .two_op = 1 },
+    [0x4b] = { .simd_size = simd_other, .two_op = 1, .vsib = 1 },
     [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
     [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
@@ -2882,6 +2883,7 @@ x86_decode_0f38(
         break;
 
     case X86EMUL_OPC_VEX_66(0, 0x49): /* sttilecfg */
+    case X86EMUL_OPC_VEX_F3(0, 0x4b): /* tilestored */
         state->desc = DstMem | SrcImplicit | Mov;
         break;
 
@@ -9632,6 +9634,99 @@ x86_emulate(
         }
         goto unrecognized_insn;
 
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x4b): /* tileloaddt1 mem,tmm */
+    case X86EMUL_OPC_VEX_F3(0x0f38, 0x4b): /* tilestored tmm,mem */
+    case X86EMUL_OPC_VEX_F2(0x0f38, 0x4b): /* tileloadd mem,tmm */
+    {
+        struct x86_tilecfg *cfg = &mmvalp->tilecfg;
+        void *row;
+
+        generate_exception_if(!mode_64bit() || vex.l || vex.w || vex.reg != 0xf,
+                              EXC_UD);
+        host_and_vcpu_must_have(amx_tile);
+        get_fpu(X86EMUL_FPU_tile);
+        sttilecfg(cfg);
+        generate_exception_if(!tiles_configured(cfg), EXC_UD);
+        generate_exception_if(!tile_valid(modrm_reg, cfg), EXC_UD);
+        generate_exception_if(cfg->colsb[modrm_reg] & 3, EXC_UD);
+        i = cfg->start_row;
+        n = cfg->rows[modrm_reg];
+        generate_exception_if(i >= n, EXC_UD);
+
+        /* Calculate stride. */
+        if ( state->sib_index != 4 )
+            ea.val = truncate_ea(*decode_gpr(state->regs,
+                                             state->sib_index) <<
+                                 state->sib_scale);
+
+        if ( vex.pfx == vex_f3 )
+        {
+            /*
+             * hvmemul_linear_mmio_access() will find a cache slot based on
+             * linear address. hvmemul_phys_mmio_access() will crash the
+             * domain if observing varying data getting written to the same
+             * cache slot. Assume that squashing earlier writes to fully
+             * overlapping addresses is permitted (like for AVX-512 scatters).
+             */
+            if ( !ea.val )
+                i = n - 1;
+            else if ( !(ea.val &
+                        ((1ul << ((ad_bytes - sizeof(*cfg->rows)) * 8)) - 1)) )
+            {
+                unsigned int clr = __builtin_ffsl(ea.val) - 1;
+                unsigned int iter = 1u << (ad_bytes * 8 - clr);
+
+                if ( iter < n - i )
+                    i = n - iter;
+            }
+        }
+
+        row = get_tile_row_buf();
+        memset(row, -1,
+               ctxt->cpuid->tile.palette[cfg->palette].bytes_per_row);
+
+        /* Set up stub. */
+        opc = init_prefixes(stub);
+        opc[0] = b;
+        /* Convert memory operand to (%rax,%riz,1) */
+        vex.b = 1;
+        vex.x = 1;
+        opc[1] = modrm & 0x3f;
+        opc[2] = 0x20;
+        opc[3] = 0xc3;
+        copy_VEX(opc, vex);
+
+        do {
+            /* Limit rows to just as many to cover the next one to access. */
+            cfg->start_row = i;
+            cfg->rows[modrm_reg] = i + 1;
+            write_tilecfg(cfg);
+
+            if ( vex.pfx != vex_f3 )
+                rc = ops->read(ea.mem.seg,
+                               truncate_ea(ea.mem.off + i * ea.val),
+                               row, cfg->colsb[modrm_reg], ctxt);
+
+            invoke_stub("", "", "=m" (dummy) : "a" (row));
+
+            if ( vex.pfx == vex_f3 )
+                rc = ops->write(ea.mem.seg,
+                                truncate_ea(ea.mem.off + i * ea.val),
+                                row, cfg->colsb[modrm_reg], ctxt);
+        } while ( rc == X86EMUL_OKAY && ++i < n );
+
+        put_stub(stub);
+
+        if ( rc == X86EMUL_OKAY )
+            cfg->start_row = 0;
+        cfg->rows[modrm_reg] = n;
+        write_tilecfg(cfg);
+
+        state->simd_size = simd_none;
+        dst.type = OP_NONE;
+        break;
+    }
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x50): /* vpdpbusd [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x51): /* vpdpbusds [xy]mm/mem,[xy]mm,[xy]mm */
     case X86EMUL_OPC_VEX_66(0x0f38, 0x52): /* vpdpwssd [xy]mm/mem,[xy]mm,[xy]mm */
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -16,6 +16,7 @@
 #include <asm/i387.h>
 #include <asm/xstate.h>
 #include <asm/asm_defns.h>
+#include <asm/x86-types.h>
 
 /*
  * Maximum size (in byte) of the XSAVE/XRSTOR save area required by all
@@ -840,6 +841,37 @@ uint64_t read_bndcfgu(void)
     return xstate->xsave_hdr.xstate_bv & X86_XCR0_BNDCSR ? bndcsr->bndcfgu : 0;
 }
 
+void write_tilecfg(const struct x86_tilecfg *tilecfg)
+{
+    unsigned long cr0 = read_cr0();
+    struct xsave_struct *xstate
+        = idle_vcpu[smp_processor_id()]->arch.xsave_area;
+
+    ASSERT(cpu_has_amx_tile);
+    clts();
+
+    memset(&xstate->xsave_hdr, 0, sizeof(xstate->xsave_hdr));
+    xstate->xsave_hdr.xstate_bv = X86_XCR0_TILECFG;
+
+    if ( cpu_has_xsavec )
+    {
+        xstate->xsave_hdr.xcomp_bv = XSTATE_COMPACTION_ENABLED |
+                                     X86_XCR0_TILECFG;
+
+        memcpy(xstate + 1, tilecfg, sizeof(*tilecfg));
+    }
+    else
+        memcpy((void *)xstate + xstate_offset(X86_XCR0_TILECFG_POS),
+               tilecfg, sizeof(*tilecfg));
+
+    asm volatile ( ".byte 0x0f,0xae,0x2f\n" /* xrstor */
+                   :: "m" (*xstate), "D" (xstate),
+                      "a" (X86_XCR0_TILECFG), "d" (0) );
+
+    if ( cr0 & X86_CR0_TS )
+        write_cr0(cr0);
+}
+
 void xstate_set_init(uint64_t mask)
 {
     unsigned long cr0 = read_cr0();
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -93,6 +93,8 @@ uint64_t get_xcr0(void);
 void set_msr_xss(u64 xss);
 uint64_t get_msr_xss(void);
 uint64_t read_bndcfgu(void);
+struct x86_tilecfg;
+void write_tilecfg(const struct x86_tilecfg *tilecfg);
 void xsave(struct vcpu *v, uint64_t mask);
 void xrstor(struct vcpu *v, uint64_t mask);
 void xstate_set_init(uint64_t mask);



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 20/22] x86emul: support tile multiplication insns
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (18 preceding siblings ...)
  2021-04-22 14:55 ` [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE Jan Beulich
@ 2021-04-22 14:56 ` Jan Beulich
  2021-04-22 14:57 ` [PATCH v3 21/22] x86emul: test AMX insns Jan Beulich
  2021-04-22 14:57 ` [PATCH v3 22/22] x86: permit guests to use AMX and XFD Jan Beulich
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:56 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

Since these don't allow for memory operands, the main thing to do here
is to check the large set of #UD conditions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/predicates.c
+++ b/tools/tests/x86_emulator/predicates.c
@@ -1349,6 +1349,11 @@ static const struct vex {
     { { 0x58 }, 2, T, R, pfx_66, W0, Ln }, /* vpbroadcastd */
     { { 0x59 }, 2, T, R, pfx_66, W0, Ln }, /* vpbroadcastq */
     { { 0x5a }, 2, F, R, pfx_66, W0, L1 }, /* vbroadcasti128 */
+    { { 0x5c, 0xc0 }, 2, F, N, pfx_f3, W0, L0 }, /* tdpbf16ps */
+    { { 0x5e, 0xc0 }, 2, F, N, pfx_no, W0, L0 }, /* tdpbuud */
+    { { 0x5e, 0xc0 }, 2, F, N, pfx_66, W0, L0 }, /* tdpbusd */
+    { { 0x5e, 0xc0 }, 2, F, N, pfx_f3, W0, L0 }, /* tdpbsud */
+    { { 0x5e, 0xc0 }, 2, F, N, pfx_f2, W0, L0 }, /* tdpbssd */
     { { 0x78 }, 2, T, R, pfx_66, W0, Ln }, /* vpbroadcastb */
     { { 0x79 }, 2, T, R, pfx_66, W0, Ln }, /* vpbroadcastw */
     { { 0x8c }, 2, F, R, pfx_66, Wn, Ln }, /* vpmaskmov{d,q} */
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -487,6 +487,8 @@ static const struct ext0f38_table {
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
     [0x5b] = { .simd_size = simd_256, .two_op = 1, .d8s = d8s_vl_by_2 },
+    [0x5c] = { .simd_size = simd_other },
+    [0x5e] = { .simd_size = simd_other },
     [0x62] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_bw },
     [0x63] = { .simd_size = simd_packed_int, .to_mem = 1, .two_op = 1, .d8s = d8s_bw },
     [0x64 ... 0x66] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
@@ -2049,7 +2051,9 @@ amd_like(const struct x86_emulate_ctxt *
 #define vcpu_has_avx512_4fmaps() (ctxt->cpuid->feat.avx512_4fmaps)
 #define vcpu_has_avx512_vp2intersect() (ctxt->cpuid->feat.avx512_vp2intersect)
 #define vcpu_has_serialize()   (ctxt->cpuid->feat.serialize)
+#define vcpu_has_amx_bf16()    (ctxt->cpuid->feat.amx_bf16)
 #define vcpu_has_amx_tile()    (ctxt->cpuid->feat.amx_tile)
+#define vcpu_has_amx_int8()    (ctxt->cpuid->feat.amx_int8)
 #define vcpu_has_avx_vnni()    (ctxt->cpuid->feat.avx_vnni)
 #define vcpu_has_avx512_bf16() (ctxt->cpuid->feat.avx512_bf16)
 
@@ -9799,6 +9803,59 @@ x86_emulate(
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_VEX_F3(0x0f38, 0x5c): /* tdpbf16ps tmm,tmm,tmm */
+    case X86EMUL_OPC_VEX(0x0f38, 0x5e):    /* tdpbuud tmm,tmm,tmm */
+    case X86EMUL_OPC_VEX_66(0x0f38, 0x5e): /* tdpbusd tmm,tmm,tmm */
+    case X86EMUL_OPC_VEX_F3(0x0f38, 0x5e): /* tdpbsud tmm,tmm,tmm */
+    case X86EMUL_OPC_VEX_F2(0x0f38, 0x5e): /* tdpbssd tmm,tmm,tmm */
+    {
+        unsigned int vreg = vex.reg ^ 0xf;
+
+        if ( ea.type != OP_REG )
+            goto unimplemented_insn;
+        generate_exception_if(!mode_64bit() || vex.l || vex.w, EXC_UD);
+        if ( b == 0x5c )
+            host_and_vcpu_must_have(amx_bf16);
+        else
+            host_and_vcpu_must_have(amx_int8);
+        generate_exception_if(modrm_reg == modrm_rm, EXC_UD);
+        generate_exception_if(modrm_reg == vreg, EXC_UD);
+        generate_exception_if(modrm_rm == vreg, EXC_UD);
+
+        get_fpu(X86EMUL_FPU_tile);
+        sttilecfg(&mmvalp->tilecfg);
+        generate_exception_if(!tiles_configured(&mmvalp->tilecfg), EXC_UD);
+
+        /* accum: modrm_reg */
+        generate_exception_if(!tile_valid(modrm_reg, &mmvalp->tilecfg), EXC_UD);
+        /* src1: modrm_rm */
+        generate_exception_if(!tile_valid(modrm_rm, &mmvalp->tilecfg), EXC_UD);
+        /* src2: vreg */
+        generate_exception_if(!tile_valid(vreg, &mmvalp->tilecfg), EXC_UD);
+
+        generate_exception_if(mmvalp->tilecfg.colsb[modrm_reg] & 3, EXC_UD);
+        /*
+         * These are redundant with the check just below.
+        generate_exception_if(mmvalp->tilecfg.colsb[modrm_rm] & 3, EXC_UD);
+        generate_exception_if(mmvalp->tilecfg.colsb[vreg] & 3, EXC_UD);
+         */
+
+        generate_exception_if(mmvalp->tilecfg.rows[modrm_reg] !=
+                              mmvalp->tilecfg.rows[modrm_rm], EXC_UD);
+        generate_exception_if(mmvalp->tilecfg.colsb[modrm_reg] !=
+                              mmvalp->tilecfg.colsb[vreg], EXC_UD);
+        generate_exception_if(mmvalp->tilecfg.colsb[modrm_rm] !=
+                              mmvalp->tilecfg.rows[vreg] * 4, EXC_UD);
+
+        generate_exception_if(mmvalp->tilecfg.colsb[vreg] >
+                              ctxt->cpuid->tmul.maxn, EXC_UD);
+        generate_exception_if(mmvalp->tilecfg.rows[vreg] >
+                              ctxt->cpuid->tmul.maxk, EXC_UD);
+
+        op_bytes = 1; /* fake */
+        goto simd_0f_common;
+    }
+
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x62): /* vpexpand{b,w} [xyz]mm/mem,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x63): /* vpcompress{b,w} [xyz]mm,[xyz]mm/mem{k} */
         host_and_vcpu_must_have(avx512_vbmi2);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -133,7 +133,9 @@
 #define cpu_has_avx512_vp2intersect boot_cpu_has(X86_FEATURE_AVX512_VP2INTERSECT)
 #define cpu_has_tsx_force_abort boot_cpu_has(X86_FEATURE_TSX_FORCE_ABORT)
 #define cpu_has_serialize       boot_cpu_has(X86_FEATURE_SERIALIZE)
+#define cpu_has_amx_bf16        boot_cpu_has(X86_FEATURE_AMX_BF16)
 #define cpu_has_amx_tile        boot_cpu_has(X86_FEATURE_AMX_TILE)
+#define cpu_has_amx_int8        boot_cpu_has(X86_FEATURE_AMX_INT8)
 
 /* CPUID level 0x00000007:1.eax */
 #define cpu_has_avx_vnni        boot_cpu_has(X86_FEATURE_AVX_VNNI)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 21/22] x86emul: test AMX insns
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (19 preceding siblings ...)
  2021-04-22 14:56 ` [PATCH v3 20/22] x86emul: support tile multiplication insns Jan Beulich
@ 2021-04-22 14:57 ` Jan Beulich
  2021-04-22 14:57 ` [PATCH v3 22/22] x86: permit guests to use AMX and XFD Jan Beulich
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

Carry out some basic matrix operations on 2x2, 3x3, and 4x4 matrixes.

To also have a use of a non-square matrix, also transpose ones of said
square formats via linearization and multiplication by the respective
transposition permutation matrix. To generate the latter, introduce a
small helper tool. This is mainly to avoid creating / populating a
rather large matrix (up to 16x16) in a stack variable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -25,6 +25,12 @@ SHA := sse4-sha avx-sha avx512f-sha
 GF := sse2-gf avx2-gf avx512bw-gf
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG) $(AES) $(CLMUL) $(SHA) $(GF)
 
+MATRIX := amx-bf16 amx-int8
+
+ifeq ($(XEN_COMPILE_ARCH),x86_64)
+TESTCASES += $(MATRIX)
+endif
+
 OPMASK := avx512f avx512dq avx512bw
 
 ifeq ($(origin XEN_COMPILE_ARCH),override)
@@ -96,6 +102,13 @@ avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
 avx512bw-opmask-vecs := 4 8
 
+amx-bf16-dims := 2 3 4
+amx-bf16-ints :=
+amx-bf16-flts := 2
+amx-int8-dims := 2 3 4
+amx-int8-ints := 1
+amx-int8-flts :=
+
 # Suppress building by default of the harness if the compiler can't deal
 # with some of the extensions used.  Don't alter the "run" target dependencies
 # though, as this target needs to be specified manually, and things may work
@@ -170,6 +183,18 @@ endef
 define opmask-defs
 $(1)-opmask-cflags := $(foreach vec,$($(1)-opmask-vecs), "-D_$(vec) -m$(1) -Os -DSIZE=$(vec)")
 endef
+amx-cflags-common = $(CFLAGS_xeninclude) -Os -DN=$(1) -DTPM_H=tpm-$(1)x$(1).h
+define matrix-defs
+$(1).h: $(foreach dim,$($(1)-dims),tpm-$(dim)x$(dim).h)
+$(1)-cflags :=
+$(1)-cflags-x86_64 := \
+	$(foreach dim,$($(1)-dims), \
+	  $(foreach flt,$($(1)-flts), \
+	    "-D_$(dim)x$(dim) -DFLOAT_SIZE=$(flt) $(call amx-cflags-common,$(dim))") \
+	  $(foreach int,$($(1)-ints), \
+	    "-Di_$(dim)x$(dim) -DINT_SIZE=$(int) $(call amx-cflags-common,$(dim))" \
+	    "-Du_$(dim)x$(dim) -DUINT_SIZE=$(int) $(call amx-cflags-common,$(dim))"))
+endef
 
 $(foreach flavor,$(SIMD) $(FMA),$(eval $(call simd-defs,$(flavor))))
 $(foreach flavor,$(SG),$(eval $(call simd-sg-defs,$(flavor))))
@@ -178,6 +203,7 @@ $(foreach flavor,$(CLMUL),$(eval $(call
 $(foreach flavor,$(SHA),$(eval $(call simd-sha-defs,$(flavor))))
 $(foreach flavor,$(GF),$(eval $(call simd-gf-defs,$(flavor))))
 $(foreach flavor,$(OPMASK),$(eval $(call opmask-defs,$(flavor))))
+$(foreach flavor,$(MATRIX),$(eval $(call matrix-defs,$(flavor))))
 
 first-string = $(shell for s in $(1); do echo "$$s"; break; done)
 
@@ -248,6 +274,9 @@ $(addsuffix .h,$(SIMD) $(FMA) $(SG) $(AE
 
 xop.h avx512f.h: simd-fma.c
 
+$(addsuffix .c,$(MATRIX)):
+	ln -sf matrix.c $@
+
 endif # 32-bit override
 
 $(TARGET): x86-emulate.o cpuid.o test_x86_emulator.o evex-disp8.o predicates.o wrappers.o
@@ -295,6 +324,12 @@ x86-emulate.o cpuid.o test_x86_emulator.
 x86-emulate.o: x86_emulate/x86_emulate.c
 x86-emulate.o: HOSTCFLAGS += -D__XEN_TOOLS__
 
+tpm-%.h: mktpm Makefile
+	set -x; ./$< $(subst x,$(space),$*) >$@
+
+mktpm: mktpm.c
+	$(HOSTCC) $(HOSTCFLAGS) -o $@ $<
+
 # In order for our custom .type assembler directives to reliably land after
 # gcc's, we need to keep it from re-ordering top-level constructs.
 $(call cc-option-add,HOSTCFLAGS-toplevel,HOSTCC,-fno-toplevel-reorder)
--- /dev/null
+++ b/tools/tests/x86_emulator/matrix.c
@@ -0,0 +1,229 @@
+#include <stdbool.h>
+
+typedef unsigned int __attribute__((mode(QI))) uint8_t;
+typedef unsigned int __attribute__((mode(HI))) uint16_t;
+
+#define stringify_(x...) #x
+#define stringify(x...)  stringify_(x)
+
+#include <xen/asm/x86-types.h>
+
+asm ( "\t.text\n"
+      "\t.globl _start\n"
+      "_start:\n"
+      "\tjmp matrix_test" );
+
+/*
+ * For the purposes here we consider the 32-bit elements to hold just a single
+ * value, with the other slots zero-filled. This way the 2- or 4-way dot
+ * products really end up as simple multiplications, allowing us to treat the
+ * underlying insns as simple matrix multiply-and-accumulate ones. With
+ * suitably in-range numbers, this also allows us to have the compiler deal
+ * with, in particular, the bf16 fields without it actually knowing of such a
+ * type.
+ *
+ * Notation in comments:
+ * I  - identity matrix (all ones on the main diagonal)
+ * AI - all ones on the antidiagonal
+ */
+
+typedef union {
+#ifdef FLOAT_SIZE
+# define MACC "tdpbf16ps"
+    float val;
+    float res;
+    struct {
+        unsigned int zero:16;
+        unsigned int bf16:16;
+    };
+#else
+# ifdef INT_SIZE
+#  define SIGNED signed
+#  define MACC "tdpbssd"
+# else
+#  define MACC "tdpbuud"
+#  define SIGNED unsigned
+# endif
+    SIGNED int res;
+    struct {
+        SIGNED   int val :8;
+        unsigned int zero:24;
+    };
+#endif
+} elem_t;
+
+typedef elem_t tile_t[N][N];
+
+static void ldtilecfg(const struct x86_tilecfg *cfg)
+{
+    asm volatile ( "ldtilecfg %0" :: "m" (*cfg) );
+}
+
+#define load_diag(r, v) ({ \
+    struct { \
+        elem_t arr[2 * N - 1]; \
+    } in = { .arr[N - 1].val = (v) }; \
+    asm volatile ( "tileloadd -%c[scale](%[base],%[stride],%c[scale]), %%" #r \
+                   :: [base] "r" (&in.arr[N]), \
+                      [stride] "r" (-1L), \
+                      [scale] "i" (sizeof(elem_t)), \
+                      "m" (in) ); \
+})
+
+#define load_antidiag(r, v) ({ \
+    struct { \
+        elem_t arr[2 * N - 1]; \
+    } in = { .arr[N - 1].val = (v) }; \
+    asm volatile ( "tileloadd (%[base],%[stride]), %%" #r \
+                   :: [base] "r" (&in.arr), \
+                      [stride] "r" (sizeof(elem_t)), \
+                      "m" (in) ); \
+})
+
+#define load_linear(r, t) ({ \
+    (void)((t) == (const tile_t *)0); \
+    asm volatile ( "tileloadd (%[base]), %%" #r \
+                   :: [base] "r" (t), \
+                      "m" (*(t)) ); \
+})
+
+static const elem_t tpm[N * N][N * N] = {
+#include stringify(TPM_H)
+};
+
+#define load_tpm(r) \
+    asm volatile ( "tileloadd (%[base],%[stride],%c[scale]), %%" #r \
+                   :: [base] "r" (&tpm), \
+                      [stride] "r" (N * N * 1L), \
+                      [scale] "i" (sizeof(elem_t)), \
+                      "m" (tpm) ); \
+
+#define store(t, r) ({ \
+    (void)((t) == (tile_t *)0); \
+    asm volatile ( "tilestored %%" #r ", (%[base],%[stride],%c[scale])" \
+                   /* "+m" to keep the compiler from eliminating fill(). */ \
+                   : "+m" (*(t)) \
+                   : [base] "r" (t), \
+                     [stride] "r" (N * 1L), \
+                     [scale] "i" (sizeof(elem_t)) ); \
+})
+
+#define macc(srcdst, src1, src2) \
+    asm volatile ( MACC " %" #src2 ", %" #src1 ", %" #srcdst )
+
+#define mul(dst, src1, src2) ({ \
+    asm volatile ( "tilezero %" #dst ); \
+    macc(dst, src1, src2); \
+})
+
+#define add(dst, src1, src2, scratch) ({ \
+    load_diag(scratch, 1); \
+    mul(dst, src1, scratch); \
+    macc(dst, scratch, src2); \
+})
+
+static inline void fill(tile_t *t)
+{
+    unsigned int cnt = N * N;
+
+    asm ( "repe stosl"
+          : "=m" (*t), "+D" (t), "+c" (cnt)
+          : "a" (~0) );
+}
+
+static inline bool zero(const tile_t *t)
+{
+    unsigned int cnt = N * N;
+    bool zf;
+
+    asm ( "repe scasl"
+          : "=@ccz" (zf), "+D" (t), "+c" (cnt)
+          : "m" (*t), "a" (0) );
+
+    return zf;
+}
+
+#define C(cols) ((cols) * sizeof(elem_t))
+#define R(rows) (rows)
+
+int matrix_test(void)
+{
+    struct x86_tilecfg cfg = {
+        .palette = 1,
+        .colsb   = { C(N), C(N), C(N), C(N), 0, C(N * N), C(N * N), C(N * N) },
+        .rows    = { R(N), R(N), R(N), R(N), 0, R(1),     R(1),     R(N * N) },
+    };
+    tile_t x;
+    unsigned int i, j;
+
+    ldtilecfg(&cfg);
+
+    fill(&x);
+    store(&x, tmm0);
+    if ( !zero(&x) ) return __LINE__;
+
+    /* Load and store I. */
+    fill(&x);
+    load_diag(tmm0, 1);
+    store(&x, tmm0);
+    for ( i = 0; i < N; ++i )
+        for ( j = 0; j < N; ++j )
+            if ( x[i][j].res != (i == j) )
+                return __LINE__;
+
+    /* I + AI */
+    fill(&x);
+    load_antidiag(tmm1, 1);
+    add(tmm2, tmm0, tmm1, tmm3);
+    store(&x, tmm2);
+    for ( i = 0; i < N; ++i )
+        for ( j = 0; j < N; ++j )
+            if ( i == j && i + j == N - 1 )
+            {
+                if ( x[i][j].res != 2 )
+                    return __LINE__;
+            }
+            else if ( i == j || i + j == N - 1 )
+            {
+                if ( x[i][j].res != 1 )
+                    return __LINE__;
+            }
+            else if ( x[i][j].res )
+                return __LINE__;
+
+#ifndef UINT_SIZE
+    /* I + AI * -AI == 0 */
+    fill(&x);
+    load_antidiag(tmm2, -1);
+    macc(tmm0, tmm1, tmm2);
+    store(&x, tmm0);
+    if ( !zero(&x) ) return __LINE__;
+#endif
+
+    /*
+     * Transpose a matrix via linearization and multiplication by the
+     * respective transpostion permutation matrix. Note that linearization
+     * merely requires a different tile layout (see the initializer of cfg
+     * above).
+     */
+#ifdef UINT_SIZE
+# define VAL(r, c) ((c) < (r) ? (c) : (r) + (c) )
+#else
+# define VAL(r, c) ((c) < (r) ? -(r) : (r) + (c) )
+#endif
+    for ( i = 0; i < N; ++i )
+        for (j = 0; j < N; ++j )
+            x[i][j].val = VAL(i, j);
+    load_linear(tmm6, &x);
+    load_tpm(tmm7);
+    mul(tmm5, tmm6, tmm7);
+    /* There's just a single row, so re-use plain store() here. */
+    store(&x, tmm5);
+    for ( i = 0; i < N; ++i )
+        for (j = 0; j < N; ++j )
+            if ( x[i][j].res != VAL(j, i) )
+                return __LINE__;
+#undef VAL
+
+    return 0;
+}
--- /dev/null
+++ b/tools/tests/x86_emulator/mktpm.c
@@ -0,0 +1,41 @@
+/* make Transposition Permutation Matrix */
+
+#include <stdio.h>
+#include <stdlib.h>
+
+static void line(unsigned one, unsigned cols)
+{
+    unsigned i;
+
+    printf("    { ");
+    for ( i = 0; i < cols - 1; ++i )
+        printf("{ %d }, ", i == one);
+    printf("{ %d } },\n", i == one);
+}
+
+int main(int argc, char*argv[])
+{
+    unsigned i, j, m, n;
+
+    switch ( argc )
+    {
+    default:
+        fprintf(stderr, "Usage: %s <rows> [<cols>]\n", argv[0]);
+        return argc != 1;
+
+    case 3:
+        n = strtoul(argv[2], NULL, 0);
+        /* fall-through */
+    case 2:
+        m = strtoul(argv[1], NULL, 0);
+        if ( argc == 2 )
+            n = m;
+        break;
+    }
+
+    for ( i = 0; i < m * n; )
+        for ( j = i / n; j < m * n; j += m, ++i )
+            line(j, m * n);
+
+    return 0;
+}
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -44,6 +44,11 @@ asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512vbmi.h"
 #include "avx512vbmi2-vpclmulqdq.h"
 
+#ifdef __x86_64__
+#include "amx-bf16.h"
+#include "amx-int8.h"
+#endif
+
 #define verbose false /* Switch to true for far more logging. */
 
 static void blowfish_set_regs(struct cpu_user_regs *regs)
@@ -263,6 +268,33 @@ static bool simd_check_regs(const struct
     return false;
 }
 
+#ifdef __x86_64__
+
+static bool amx_check_bf16(void)
+{
+    return cp.feat.amx_bf16;
+}
+
+static bool amx_check_int8(void)
+{
+    return cp.feat.amx_int8;
+}
+
+static void amx_set_regs(struct cpu_user_regs *regs)
+{
+}
+
+static bool amx_check_regs(const struct cpu_user_regs *regs)
+{
+    asm volatile ( ".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0" ); /* tilerelease */
+    if ( !regs->eax )
+        return true;
+    printf("[line %u] ", (unsigned int)regs->eax);
+    return false;
+}
+
+#endif
+
 static const struct {
     const void *code;
     size_t size;
@@ -534,6 +566,25 @@ static const struct {
 #undef AVX512VL
 #undef SIMD_
 #undef SIMD
+#ifdef __x86_64__
+# define AMX(desc, feat, t, dim)                                              \
+    { .code = amx_ ## feat ## _x86_64_D ## t ## _ ## dim ## x ## dim,         \
+      .size = sizeof(amx_ ## feat ## _x86_64_D ## t ## _ ## dim ## x ## dim), \
+      .bitness = 64, .name = "AMX-" #desc " (" #t #dim "x" #dim ")",          \
+      .check_cpu = amx_check_ ## feat,                                        \
+      .set_regs = amx_set_regs,                                               \
+      .check_regs = amx_check_regs }
+    AMX(BF16, bf16, , 2),
+    AMX(BF16, bf16, , 3),
+    AMX(BF16, bf16, , 4),
+    AMX(INT8, int8, i, 2),
+    AMX(INT8, int8, i, 3),
+    AMX(INT8, int8, i, 4),
+    AMX(INT8, int8, u, 2),
+    AMX(INT8, int8, u, 3),
+    AMX(INT8, int8, u, 4),
+# undef AMX
+#endif
 };
 
 static unsigned int bytes_read;



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 22/22] x86: permit guests to use AMX and XFD
  2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
                   ` (20 preceding siblings ...)
  2021-04-22 14:57 ` [PATCH v3 21/22] x86emul: test AMX insns Jan Beulich
@ 2021-04-22 14:57 ` Jan Beulich
  21 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 14:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné

These features are marked experimental (for only parts of the code
actually having got tested yet, while other parts require respective
hardware) and opt-in for guests.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,9 @@ The format is based on [Keep a Changelog
 
 ## [unstable UNRELEASED](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=staging) - TBD
 
+### Added / support upgraded
+ - x86 AMX and XFD (Experimental)
+
 ## [4.15.0 UNRELEASED](https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=RELEASE-4.15.0) - TBD
 
 ### Added / support upgraded
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -61,6 +61,10 @@ For the Cortex A57 r0p0 - r1p1, see Erra
 
     Status: Tech Preview
 
+### x86 AMX and XFD
+
+    Status: Experimental
+
 ### IOMMU
 
     Status, AMD IOMMU: Supported
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -191,7 +191,7 @@ XEN_CPUFEATURE(XSAVEOPT,      4*32+ 0) /
 XEN_CPUFEATURE(XSAVEC,        4*32+ 1) /*A  XSAVEC/XRSTORC instructions */
 XEN_CPUFEATURE(XGETBV1,       4*32+ 2) /*A  XGETBV with %ecx=1 */
 XEN_CPUFEATURE(XSAVES,        4*32+ 3) /*S  XSAVES/XRSTORS instructions */
-XEN_CPUFEATURE(XFD,           4*32+ 4) /*   XFD / XFD_ERR MSRs */
+XEN_CPUFEATURE(XFD,           4*32+ 4) /*a  XFD / XFD_ERR MSRs */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.ebx, word 5 */
 XEN_CPUFEATURE(FSGSBASE,      5*32+ 0) /*A  {RD,WR}{FS,GS}BASE instructions */
@@ -269,9 +269,9 @@ XEN_CPUFEATURE(MD_CLEAR,      9*32+10) /
 XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */
 XEN_CPUFEATURE(SERIALIZE,     9*32+14) /*a  SERIALIZE insn */
 XEN_CPUFEATURE(CET_IBT,       9*32+20) /*   CET - Indirect Branch Tracking */
-XEN_CPUFEATURE(AMX_BF16,      9*32+22) /*   AMX BFloat16 instructions */
-XEN_CPUFEATURE(AMX_TILE,      9*32+24) /*   AMX tile architecture */
-XEN_CPUFEATURE(AMX_INT8,      9*32+25) /*   AMX 8-bit integer instructions */
+XEN_CPUFEATURE(AMX_BF16,      9*32+22) /*a  AMX BFloat16 instructions */
+XEN_CPUFEATURE(AMX_TILE,      9*32+24) /*a  AMX tile architecture */
+XEN_CPUFEATURE(AMX_INT8,      9*32+25) /*a  AMX 8-bit integer instructions */
 XEN_CPUFEATURE(IBRSB,         9*32+26) /*A  IBRS and IBPB support (used by Intel) */
 XEN_CPUFEATURE(STIBP,         9*32+27) /*A  STIBP */
 XEN_CPUFEATURE(L1D_FLUSH,     9*32+28) /*S  MSR_FLUSH_CMD and L1D flush. */



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE
  2021-04-22 14:55 ` [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE Jan Beulich
@ 2021-04-22 15:06   ` Jan Beulich
  2021-04-22 15:11     ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 15:06 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné, xen-devel

Paul,

On 22.04.2021 16:55, Jan Beulich wrote:
> +        do {
> +            /* Limit rows to just as many to cover the next one to access. */
> +            cfg->start_row = i;
> +            cfg->rows[modrm_reg] = i + 1;
> +            write_tilecfg(cfg);
> +
> +            if ( vex.pfx != vex_f3 )
> +                rc = ops->read(ea.mem.seg,
> +                               truncate_ea(ea.mem.off + i * ea.val),
> +                               row, cfg->colsb[modrm_reg], ctxt);
> +
> +            invoke_stub("", "", "=m" (dummy) : "a" (row));
> +
> +            if ( vex.pfx == vex_f3 )
> +                rc = ops->write(ea.mem.seg,
> +                                truncate_ea(ea.mem.off + i * ea.val),
> +                                row, cfg->colsb[modrm_reg], ctxt);
> +        } while ( rc == X86EMUL_OKAY && ++i < n );

in principle tiles could have rows larger than 64 bytes without any
separate CPUID feature flag qualifying this. struct hvm_mmio_cache,
otoh, is having a fixed-size 64-byte buffer right now. Therefore I'm
wondering whether we'd want to switch to dynamically allocating that
to the minimum of 64 bytes and the size of a tile row, just as a
precautionary measure.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE
  2021-04-22 15:06   ` Jan Beulich
@ 2021-04-22 15:11     ` Jan Beulich
  2021-04-26  7:12       ` Paul Durrant
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-04-22 15:11 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Andrew Cooper, George Dunlap, Wei Liu, Roger Pau Monné, xen-devel

On 22.04.2021 17:06, Jan Beulich wrote:
> On 22.04.2021 16:55, Jan Beulich wrote:
>> +        do {
>> +            /* Limit rows to just as many to cover the next one to access. */
>> +            cfg->start_row = i;
>> +            cfg->rows[modrm_reg] = i + 1;
>> +            write_tilecfg(cfg);
>> +
>> +            if ( vex.pfx != vex_f3 )
>> +                rc = ops->read(ea.mem.seg,
>> +                               truncate_ea(ea.mem.off + i * ea.val),
>> +                               row, cfg->colsb[modrm_reg], ctxt);
>> +
>> +            invoke_stub("", "", "=m" (dummy) : "a" (row));
>> +
>> +            if ( vex.pfx == vex_f3 )
>> +                rc = ops->write(ea.mem.seg,
>> +                                truncate_ea(ea.mem.off + i * ea.val),
>> +                                row, cfg->colsb[modrm_reg], ctxt);
>> +        } while ( rc == X86EMUL_OKAY && ++i < n );
> 
> in principle tiles could have rows larger than 64 bytes without any
> separate CPUID feature flag qualifying this. struct hvm_mmio_cache,
> otoh, is having a fixed-size 64-byte buffer right now. Therefore I'm
> wondering whether we'd want to switch to dynamically allocating that
> to the minimum of 64 bytes and the size of a tile row, just as a
> precautionary measure.

Actually, as it occurred to me only after sending, enlarging tile size
would under almost all circumstances require a new XSTATE component,
which we'd need to enable first. I consider it less likely that they'd
permit a wider range of layouts without increasing tile size. But we
might still want to play safe.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE
  2021-04-22 15:11     ` Jan Beulich
@ 2021-04-26  7:12       ` Paul Durrant
  2021-04-29  9:40         ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Paul Durrant @ 2021-04-26  7:12 UTC (permalink / raw)
  To: xen-devel

On 22/04/2021 16:11, Jan Beulich wrote:
> On 22.04.2021 17:06, Jan Beulich wrote:
>> On 22.04.2021 16:55, Jan Beulich wrote:
>>> +        do {
>>> +            /* Limit rows to just as many to cover the next one to access. */
>>> +            cfg->start_row = i;
>>> +            cfg->rows[modrm_reg] = i + 1;
>>> +            write_tilecfg(cfg);
>>> +
>>> +            if ( vex.pfx != vex_f3 )
>>> +                rc = ops->read(ea.mem.seg,
>>> +                               truncate_ea(ea.mem.off + i * ea.val),
>>> +                               row, cfg->colsb[modrm_reg], ctxt);
>>> +
>>> +            invoke_stub("", "", "=m" (dummy) : "a" (row));
>>> +
>>> +            if ( vex.pfx == vex_f3 )
>>> +                rc = ops->write(ea.mem.seg,
>>> +                                truncate_ea(ea.mem.off + i * ea.val),
>>> +                                row, cfg->colsb[modrm_reg], ctxt);
>>> +        } while ( rc == X86EMUL_OKAY && ++i < n );
>>
>> in principle tiles could have rows larger than 64 bytes without any
>> separate CPUID feature flag qualifying this. struct hvm_mmio_cache,
>> otoh, is having a fixed-size 64-byte buffer right now. Therefore I'm
>> wondering whether we'd want to switch to dynamically allocating that
>> to the minimum of 64 bytes and the size of a tile row, just as a
>> precautionary measure.
> 
> Actually, as it occurred to me only after sending, enlarging tile size
> would under almost all circumstances require a new XSTATE component,
> which we'd need to enable first. I consider it less likely that they'd
> permit a wider range of layouts without increasing tile size. But we
> might still want to play safe.
> 

I guess on-demand reallocation to a larger size would be fine. Certainly 
we want to be sure we don't overflow.

   Paul

> Jan
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE
  2021-04-26  7:12       ` Paul Durrant
@ 2021-04-29  9:40         ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-04-29  9:40 UTC (permalink / raw)
  To: paul; +Cc: xen-devel

On 26.04.2021 09:12, Paul Durrant wrote:
> On 22/04/2021 16:11, Jan Beulich wrote:
>> On 22.04.2021 17:06, Jan Beulich wrote:
>>> On 22.04.2021 16:55, Jan Beulich wrote:
>>>> +        do {
>>>> +            /* Limit rows to just as many to cover the next one to access. */
>>>> +            cfg->start_row = i;
>>>> +            cfg->rows[modrm_reg] = i + 1;
>>>> +            write_tilecfg(cfg);
>>>> +
>>>> +            if ( vex.pfx != vex_f3 )
>>>> +                rc = ops->read(ea.mem.seg,
>>>> +                               truncate_ea(ea.mem.off + i * ea.val),
>>>> +                               row, cfg->colsb[modrm_reg], ctxt);
>>>> +
>>>> +            invoke_stub("", "", "=m" (dummy) : "a" (row));
>>>> +
>>>> +            if ( vex.pfx == vex_f3 )
>>>> +                rc = ops->write(ea.mem.seg,
>>>> +                                truncate_ea(ea.mem.off + i * ea.val),
>>>> +                                row, cfg->colsb[modrm_reg], ctxt);
>>>> +        } while ( rc == X86EMUL_OKAY && ++i < n );
>>>
>>> in principle tiles could have rows larger than 64 bytes without any
>>> separate CPUID feature flag qualifying this. struct hvm_mmio_cache,
>>> otoh, is having a fixed-size 64-byte buffer right now. Therefore I'm
>>> wondering whether we'd want to switch to dynamically allocating that
>>> to the minimum of 64 bytes and the size of a tile row, just as a
>>> precautionary measure.
>>
>> Actually, as it occurred to me only after sending, enlarging tile size
>> would under almost all circumstances require a new XSTATE component,
>> which we'd need to enable first. I consider it less likely that they'd
>> permit a wider range of layouts without increasing tile size. But we
>> might still want to play safe.
>>
> 
> I guess on-demand reallocation to a larger size would be fine. Certainly 
> we want to be sure we don't overflow.

Okay, I've added a patch doing not just this, but (perhaps even more
important) also increase struct hvmemul_cache's capacity on such
hardware.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations
  2021-04-22 14:43 ` [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations Jan Beulich
@ 2021-05-03 11:31   ` Roger Pau Monné
  2021-05-03 13:50     ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Roger Pau Monné @ 2021-05-03 11:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Ian Jackson,
	Julien Grall, Stefano Stabellini, Wei Liu

On Thu, Apr 22, 2021 at 04:43:39PM +0200, Jan Beulich wrote:
> All of the array allocations in grant_table_init() can exceed a page's
> worth of memory, which xmalloc()-based interfaces aren't really suitable
> for after boot. We also don't need any of these allocations to be
> physically contiguous.. Introduce interfaces dynamically switching
> between xmalloc() et al and vmalloc() et al, based on requested size,
> and use them instead.
> 
> All the wrappers in the new header get cloned mostly verbatim from
> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
> for sizes and to unsigned int for alignments.

We seem to be growing a non-trivial amount of memory allocation
families of functions: xmalloc, vmalloc and now xvmalloc.

I think from a consumer PoV it would make sense to only have two of
those: one for allocations that require to be physically contiguous,
and one for allocation that don't require it.

Even then, requesting for physically contiguous allocations could be
done by passing a flag to the same interface that's used for
non-contiguous allocations.

Maybe another option would be to expand the existing
v{malloc,realloc,...} set of functions to have your proposed behaviour
for xv{malloc,realloc,...}?

> --- /dev/null
> +++ b/xen/include/xen/xvmalloc.h
> @@ -0,0 +1,73 @@
> +
> +#ifndef __XVMALLOC_H__
> +#define __XVMALLOC_H__
> +
> +#include <xen/cache.h>
> +#include <xen/types.h>
> +
> +/*
> + * Xen malloc/free-style interface for allocations possibly exceeding a page's
> + * worth of memory, as long as there's no need to have physically contiguous
> + * memory allocated.  These should be used in preference to xmalloc() et al
> + * whenever the size is not known to be constrained to at most a single page.

Even when it's know that size <= PAGE_SIZE this helpers are
appropriate as they would end up using xmalloc, so I think it's fine to
recommend them universally as long as there's no need to alloc
physically contiguous memory?

Granted there's a bit more overhead from the logic to decide between
using xmalloc or vmalloc &c, but that's IMO not that big of a deal in
order to not recommend this interface globally for non-contiguous
alloc.

> + */
> +
> +/* Allocate space for typed object. */
> +#define xvmalloc(_type) ((_type *)_xvmalloc(sizeof(_type), __alignof__(_type)))
> +#define xvzalloc(_type) ((_type *)_xvzalloc(sizeof(_type), __alignof__(_type)))
> +
> +/* Allocate space for array of typed objects. */
> +#define xvmalloc_array(_type, _num) \
> +    ((_type *)_xvmalloc_array(sizeof(_type), __alignof__(_type), _num))
> +#define xvzalloc_array(_type, _num) \
> +    ((_type *)_xvzalloc_array(sizeof(_type), __alignof__(_type), _num))
> +
> +/* Allocate space for a structure with a flexible array of typed objects. */
> +#define xvzalloc_flex_struct(type, field, nr) \
> +    ((type *)_xvzalloc(offsetof(type, field[nr]), __alignof__(type)))
> +
> +#define xvmalloc_flex_struct(type, field, nr) \
> +    ((type *)_xvmalloc(offsetof(type, field[nr]), __alignof__(type)))
> +
> +/* Re-allocate space for a structure with a flexible array of typed objects. */
> +#define xvrealloc_flex_struct(ptr, field, nr)                          \
> +    ((typeof(ptr))_xvrealloc(ptr, offsetof(typeof(*(ptr)), field[nr]), \
> +                             __alignof__(typeof(*(ptr)))))
> +
> +/* Allocate untyped storage. */
> +#define xvmalloc_bytes(_bytes) _xvmalloc(_bytes, SMP_CACHE_BYTES)
> +#define xvzalloc_bytes(_bytes) _xvzalloc(_bytes, SMP_CACHE_BYTES)

I see xmalloc does the same, wouldn't it be enough to align to a lower
value? Seems quite wasteful to align to 128 on x86 by default?

> +
> +/* Free any of the above. */
> +extern void xvfree(void *);
> +
> +/* Free an allocation, and zero the pointer to it. */
> +#define XVFREE(p) do { \
> +    xvfree(p);         \
> +    (p) = NULL;        \
> +} while ( false )
> +
> +/* Underlying functions */
> +extern void *_xvmalloc(size_t size, unsigned int align);
> +extern void *_xvzalloc(size_t size, unsigned int align);
> +extern void *_xvrealloc(void *ptr, size_t size, unsigned int align);

Nit: I would drop the 'extern' keyword from the function prototypes.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 04/22] x86/xstate: re-use valid_xcr0() for boot-time checks
  2021-04-22 14:45 ` [PATCH v3 04/22] x86/xstate: re-use valid_xcr0() for boot-time checks Jan Beulich
@ 2021-05-03 11:53   ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-05-03 11:53 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monné

On 22/04/2021 15:45, Jan Beulich wrote:
> Instead of (just partially) open-coding it, re-use the function after
> suitably moving it up.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations
  2021-05-03 11:31   ` Roger Pau Monné
@ 2021-05-03 13:50     ` Jan Beulich
  2021-05-03 14:54       ` Roger Pau Monné
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-05-03 13:50 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Ian Jackson,
	Julien Grall, Stefano Stabellini, Wei Liu

On 03.05.2021 13:31, Roger Pau Monné wrote:
> On Thu, Apr 22, 2021 at 04:43:39PM +0200, Jan Beulich wrote:
>> All of the array allocations in grant_table_init() can exceed a page's
>> worth of memory, which xmalloc()-based interfaces aren't really suitable
>> for after boot. We also don't need any of these allocations to be
>> physically contiguous.. Introduce interfaces dynamically switching
>> between xmalloc() et al and vmalloc() et al, based on requested size,
>> and use them instead.
>>
>> All the wrappers in the new header get cloned mostly verbatim from
>> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
>> for sizes and to unsigned int for alignments.
> 
> We seem to be growing a non-trivial amount of memory allocation
> families of functions: xmalloc, vmalloc and now xvmalloc.
> 
> I think from a consumer PoV it would make sense to only have two of
> those: one for allocations that require to be physically contiguous,
> and one for allocation that don't require it.
> 
> Even then, requesting for physically contiguous allocations could be
> done by passing a flag to the same interface that's used for
> non-contiguous allocations.
> 
> Maybe another option would be to expand the existing
> v{malloc,realloc,...} set of functions to have your proposed behaviour
> for xv{malloc,realloc,...}?

All of this and some of your remarks further down has already been
discussed. A working group has been formed. No progress since. Yes,
a smaller set of interfaces may be the way to go. Controlling
behavior via flags, otoh, is very much not malloc()-like. Making
existing functions have the intended new behavior is a no-go without
auditing all present uses, to find those few which actually may need
physically contiguous allocations.

Having seen similar naming elsewhere, I did propose xnew() /
xdelete() (plus array and flex-struct counterparts) as the single
new recommended interface; didn't hear back yet. But we'd switch to
that gradually, so intermediately there would still be a larger set
of interfaces.

I'm not convinced we should continue to have byte-granular allocation
functions producing physically contiguous memory. I think the page
allocator should be used directly in such cases.

>> --- /dev/null
>> +++ b/xen/include/xen/xvmalloc.h
>> @@ -0,0 +1,73 @@
>> +
>> +#ifndef __XVMALLOC_H__
>> +#define __XVMALLOC_H__
>> +
>> +#include <xen/cache.h>
>> +#include <xen/types.h>
>> +
>> +/*
>> + * Xen malloc/free-style interface for allocations possibly exceeding a page's
>> + * worth of memory, as long as there's no need to have physically contiguous
>> + * memory allocated.  These should be used in preference to xmalloc() et al
>> + * whenever the size is not known to be constrained to at most a single page.
> 
> Even when it's know that size <= PAGE_SIZE this helpers are
> appropriate as they would end up using xmalloc, so I think it's fine to
> recommend them universally as long as there's no need to alloc
> physically contiguous memory?
> 
> Granted there's a bit more overhead from the logic to decide between
> using xmalloc or vmalloc &c, but that's IMO not that big of a deal in
> order to not recommend this interface globally for non-contiguous
> alloc.

As long as xmalloc() and vmalloc() are meant stay around as separate
interfaces, I wouldn't want to "forbid" their use when it's sufficiently
clear that they would be chosen by the new function anyway. Otoh, if the
new function became more powerful in terms of falling back to the
respectively other lower level function, that might be an argument in
favor of always using the new interfaces.

>> + */
>> +
>> +/* Allocate space for typed object. */
>> +#define xvmalloc(_type) ((_type *)_xvmalloc(sizeof(_type), __alignof__(_type)))
>> +#define xvzalloc(_type) ((_type *)_xvzalloc(sizeof(_type), __alignof__(_type)))
>> +
>> +/* Allocate space for array of typed objects. */
>> +#define xvmalloc_array(_type, _num) \
>> +    ((_type *)_xvmalloc_array(sizeof(_type), __alignof__(_type), _num))
>> +#define xvzalloc_array(_type, _num) \
>> +    ((_type *)_xvzalloc_array(sizeof(_type), __alignof__(_type), _num))
>> +
>> +/* Allocate space for a structure with a flexible array of typed objects. */
>> +#define xvzalloc_flex_struct(type, field, nr) \
>> +    ((type *)_xvzalloc(offsetof(type, field[nr]), __alignof__(type)))
>> +
>> +#define xvmalloc_flex_struct(type, field, nr) \
>> +    ((type *)_xvmalloc(offsetof(type, field[nr]), __alignof__(type)))
>> +
>> +/* Re-allocate space for a structure with a flexible array of typed objects. */
>> +#define xvrealloc_flex_struct(ptr, field, nr)                          \
>> +    ((typeof(ptr))_xvrealloc(ptr, offsetof(typeof(*(ptr)), field[nr]), \
>> +                             __alignof__(typeof(*(ptr)))))
>> +
>> +/* Allocate untyped storage. */
>> +#define xvmalloc_bytes(_bytes) _xvmalloc(_bytes, SMP_CACHE_BYTES)
>> +#define xvzalloc_bytes(_bytes) _xvzalloc(_bytes, SMP_CACHE_BYTES)
> 
> I see xmalloc does the same, wouldn't it be enough to align to a lower
> value? Seems quite wasteful to align to 128 on x86 by default?

Yes, it would. Personally (see "[PATCH v2 0/8] assorted replacement of
x[mz]alloc_bytes()") I think these ..._bytes() wrappers should all go
away. Hence I don't think it's very important how exactly they behave,
and in turn it's then best to have them match x[mz]alloc_bytes().

>> +
>> +/* Free any of the above. */
>> +extern void xvfree(void *);
>> +
>> +/* Free an allocation, and zero the pointer to it. */
>> +#define XVFREE(p) do { \
>> +    xvfree(p);         \
>> +    (p) = NULL;        \
>> +} while ( false )
>> +
>> +/* Underlying functions */
>> +extern void *_xvmalloc(size_t size, unsigned int align);
>> +extern void *_xvzalloc(size_t size, unsigned int align);
>> +extern void *_xvrealloc(void *ptr, size_t size, unsigned int align);
> 
> Nit: I would drop the 'extern' keyword from the function prototypes.

Ah yes, will do. Simply a result of taking the other header as basis.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes
  2021-04-22 14:44 ` [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes Jan Beulich
@ 2021-05-03 13:57   ` Andrew Cooper
  2021-05-03 14:22     ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-05-03 13:57 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: George Dunlap, Ian Jackson, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné

On 22/04/2021 15:44, Jan Beulich wrote:
> vCPU-s get maximum size areas allocated initially. Hidden (and in
> particular default-off) features may allow for a smaller size area to
> suffice.
>
> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Use 1ul instead of 1ull. Re-base.
> ---
> This could be further shrunk if we used XSAVEC / if we really used
> XSAVES, as then we don't need to also cover the holes. But since we
> currently use neither of the two in reality, this would require more
> work than just adding the alternative size calculation here.
>
> Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
> from arch_vcpu_create(), I'm not sure we really need this two-stage
> approach - the slightly longer period of time during which
> v->arch.xsave_area would remain NULL doesn't look all that problematic.
> But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
> to stay anyway in some form, so the extra code churn may not be worth
> it.
>
> Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the
> interface to use here. But it remains to be determined whether the
> xcr0_accum field is meant to be inclusive of XSS (in which case it would
> better be renamed) or exclusive. Right now there's no difference as we
> don't support any XSS-controlled features.

I've been figuring out what we need to for supervisors states.  The
current code is not in a good shape, but I also think some of the
changes in this series take us in an unhelpful direction.

I've got a cleanup series which I will post shortly.  It interacts
texturally although not fundamentally with this series, but does fix
several issues.

For supervisor states, we need use XSAVES unilaterally, even for PV. 
This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's
in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent
of TSS.RSP0).

A consequence is that Xen's data handling shall use the compressed
format, and include supervisor states.  (While in principle we could
manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each
WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS
instruction.)

I'm planning a host xss setting, similar to mmu_cr4_features, which
shall be the setting in context for everything other than HVM vcpus
(which need the guest setting in context, and/or the VT-x bodge to
support host-only states).  Amongst other things, all context switch
paths, including from-HVM, need to step XSS up to the host setting to
let XSAVES function correctly.

However, a consequence of this is that the size of the xsave area needs
deriving from host, as well as guest-max state.  i.e. even if some VMs
aren't using CET, we still need space in the xsave areas to function
correctly when a single VM is using CET.

Another consequence is that we need to rethink our hypercall behaviour. 
There is no such thing as supervisor states in an uncompressed XSAVE
image, which means we can't continue with that being the ABI.

I've also found some substantial issues with how we handle
xcr0/xcr0_accum and plan to address these.  There is no such thing as
xcr0 without the bottom bit set, ever, and xcr0_accum needs to default
to X87|SSE seeing as that's how we use it anyway.  However, in a context
switch, I expect we'll still be using xcr0_accum | host_xss when it
comes to the context switch path.

In terms of actual context switching, we want to be using XSAVES/XRSTORS
whenever it is available, even if we're not using supervisor states. 
XSAVES has both the inuse and modified optimisations, without the broken
consequence of XSAVEOPT (which is firmly in the "don't ever use this"
bucket now).

There's no point ever using XSAVEC.  There is no hardware where it
exists in the absence of XSAVES, and can't even in theoretical
circumstances due to (perhaps unintentional) linkage of the CPUID data. 
XSAVEC also doesn't use the modified optimisation, and is therefore
strictly worse than XSAVES, even when MSR_XSS is 0.

Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or
FXSAVE, depending on hardware capability.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes
  2021-05-03 13:57   ` Andrew Cooper
@ 2021-05-03 14:22     ` Jan Beulich
  2021-05-11 16:41       ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-05-03 14:22 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: George Dunlap, Ian Jackson, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 03.05.2021 15:57, Andrew Cooper wrote:
> On 22/04/2021 15:44, Jan Beulich wrote:
>> vCPU-s get maximum size areas allocated initially. Hidden (and in
>> particular default-off) features may allow for a smaller size area to
>> suffice.
>>
>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v2: Use 1ul instead of 1ull. Re-base.
>> ---
>> This could be further shrunk if we used XSAVEC / if we really used
>> XSAVES, as then we don't need to also cover the holes. But since we
>> currently use neither of the two in reality, this would require more
>> work than just adding the alternative size calculation here.
>>
>> Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
>> from arch_vcpu_create(), I'm not sure we really need this two-stage
>> approach - the slightly longer period of time during which
>> v->arch.xsave_area would remain NULL doesn't look all that problematic.
>> But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
>> to stay anyway in some form, so the extra code churn may not be worth
>> it.
>>
>> Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the
>> interface to use here. But it remains to be determined whether the
>> xcr0_accum field is meant to be inclusive of XSS (in which case it would
>> better be renamed) or exclusive. Right now there's no difference as we
>> don't support any XSS-controlled features.
> 
> I've been figuring out what we need to for supervisors states.  The
> current code is not in a good shape, but I also think some of the
> changes in this series take us in an unhelpful direction.

From reading through the rest your reply I'm not sure I see what you
mean. ORing in host_xss at certain points shouldn't be a big deal.

> I've got a cleanup series which I will post shortly.  It interacts
> texturally although not fundamentally with this series, but does fix
> several issues.
> 
> For supervisor states, we need use XSAVES unilaterally, even for PV. 
> This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's
> in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent
> of TSS.RSP0).
> 
> 
> A consequence is that Xen's data handling shall use the compressed
> format, and include supervisor states.  (While in principle we could
> manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each
> WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS
> instruction.)

I agree.

> I'm planning a host xss setting, similar to mmu_cr4_features, which
> shall be the setting in context for everything other than HVM vcpus
> (which need the guest setting in context, and/or the VT-x bodge to
> support host-only states).  Amongst other things, all context switch
> paths, including from-HVM, need to step XSS up to the host setting to
> let XSAVES function correctly.
> 
> However, a consequence of this is that the size of the xsave area needs
> deriving from host, as well as guest-max state.  i.e. even if some VMs
> aren't using CET, we still need space in the xsave areas to function
> correctly when a single VM is using CET.

Right - as said above, taking this into consideration here shouldn't
be overly problematic.

> Another consequence is that we need to rethink our hypercall behaviour. 
> There is no such thing as supervisor states in an uncompressed XSAVE
> image, which means we can't continue with that being the ABI.

I don't think the hypercall input / output blob needs to follow any
specific hardware layout.

> I've also found some substantial issues with how we handle
> xcr0/xcr0_accum and plan to address these.  There is no such thing as
> xcr0 without the bottom bit set, ever, and xcr0_accum needs to default
> to X87|SSE seeing as that's how we use it anyway.  However, in a context
> switch, I expect we'll still be using xcr0_accum | host_xss when it
> comes to the context switch path.

Right, and to avoid confusion I think we also want to move from
xcr0_accum to e.g. xstate_accum, covering both XCR0 and XSS parts
all in one go.

> In terms of actual context switching, we want to be using XSAVES/XRSTORS
> whenever it is available, even if we're not using supervisor states. 
> XSAVES has both the inuse and modified optimisations, without the broken
> consequence of XSAVEOPT (which is firmly in the "don't ever use this"
> bucket now).

The XSAVEOPT anomaly is affecting user mode only, isn't it? Or are
you talking of something I have forgot about?

> There's no point ever using XSAVEC.  There is no hardware where it
> exists in the absence of XSAVES, and can't even in theoretical
> circumstances due to (perhaps unintentional) linkage of the CPUID data. 
> XSAVEC also doesn't use the modified optimisation, and is therefore
> strictly worse than XSAVES, even when MSR_XSS is 0.
> 
> Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or
> FXSAVE, depending on hardware capability.

Makes sense to me (perhaps - see above - minus your omission of
XSAVEOPT here).

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations
  2021-05-03 13:50     ` Jan Beulich
@ 2021-05-03 14:54       ` Roger Pau Monné
  2021-05-03 15:21         ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Roger Pau Monné @ 2021-05-03 14:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Ian Jackson,
	Julien Grall, Stefano Stabellini, Wei Liu

On Mon, May 03, 2021 at 03:50:48PM +0200, Jan Beulich wrote:
> On 03.05.2021 13:31, Roger Pau Monné wrote:
> > On Thu, Apr 22, 2021 at 04:43:39PM +0200, Jan Beulich wrote:
> >> All of the array allocations in grant_table_init() can exceed a page's
> >> worth of memory, which xmalloc()-based interfaces aren't really suitable
> >> for after boot. We also don't need any of these allocations to be
> >> physically contiguous.. Introduce interfaces dynamically switching
> >> between xmalloc() et al and vmalloc() et al, based on requested size,
> >> and use them instead.
> >>
> >> All the wrappers in the new header get cloned mostly verbatim from
> >> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
> >> for sizes and to unsigned int for alignments.
> > 
> > We seem to be growing a non-trivial amount of memory allocation
> > families of functions: xmalloc, vmalloc and now xvmalloc.
> > 
> > I think from a consumer PoV it would make sense to only have two of
> > those: one for allocations that require to be physically contiguous,
> > and one for allocation that don't require it.
> > 
> > Even then, requesting for physically contiguous allocations could be
> > done by passing a flag to the same interface that's used for
> > non-contiguous allocations.
> > 
> > Maybe another option would be to expand the existing
> > v{malloc,realloc,...} set of functions to have your proposed behaviour
> > for xv{malloc,realloc,...}?
> 
> All of this and some of your remarks further down has already been
> discussed. A working group has been formed. No progress since. Yes,
> a smaller set of interfaces may be the way to go. Controlling
> behavior via flags, otoh, is very much not malloc()-like. Making
> existing functions have the intended new behavior is a no-go without
> auditing all present uses, to find those few which actually may need
> physically contiguous allocations.

But you could make your proposed xvmalloc logic the implementation
behind vmalloc, as that would still be perfectly fine and safe? (ie:
existing users of vmalloc already expect non-physically contiguous
memory). You would just optimize the size < PAGE_SIZE for that
interface?

> Having seen similar naming elsewhere, I did propose xnew() /
> xdelete() (plus array and flex-struct counterparts) as the single
> new recommended interface; didn't hear back yet. But we'd switch to
> that gradually, so intermediately there would still be a larger set
> of interfaces.
> 
> I'm not convinced we should continue to have byte-granular allocation
> functions producing physically contiguous memory. I think the page
> allocator should be used directly in such cases.
> 
> >> --- /dev/null
> >> +++ b/xen/include/xen/xvmalloc.h
> >> @@ -0,0 +1,73 @@
> >> +
> >> +#ifndef __XVMALLOC_H__
> >> +#define __XVMALLOC_H__
> >> +
> >> +#include <xen/cache.h>
> >> +#include <xen/types.h>
> >> +
> >> +/*
> >> + * Xen malloc/free-style interface for allocations possibly exceeding a page's
> >> + * worth of memory, as long as there's no need to have physically contiguous
> >> + * memory allocated.  These should be used in preference to xmalloc() et al
> >> + * whenever the size is not known to be constrained to at most a single page.
> > 
> > Even when it's know that size <= PAGE_SIZE this helpers are
> > appropriate as they would end up using xmalloc, so I think it's fine to
> > recommend them universally as long as there's no need to alloc
> > physically contiguous memory?
> > 
> > Granted there's a bit more overhead from the logic to decide between
> > using xmalloc or vmalloc &c, but that's IMO not that big of a deal in
> > order to not recommend this interface globally for non-contiguous
> > alloc.
> 
> As long as xmalloc() and vmalloc() are meant stay around as separate
> interfaces, I wouldn't want to "forbid" their use when it's sufficiently
> clear that they would be chosen by the new function anyway. Otoh, if the
> new function became more powerful in terms of falling back to the

What do you mean with more powerful here?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations
  2021-05-03 14:54       ` Roger Pau Monné
@ 2021-05-03 15:21         ` Jan Beulich
  2021-05-03 16:39           ` Roger Pau Monné
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-05-03 15:21 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, George Dunlap, Ian Jackson,
	Julien Grall, Stefano Stabellini, Wei Liu

On 03.05.2021 16:54, Roger Pau Monné wrote:
> On Mon, May 03, 2021 at 03:50:48PM +0200, Jan Beulich wrote:
>> On 03.05.2021 13:31, Roger Pau Monné wrote:
>>> On Thu, Apr 22, 2021 at 04:43:39PM +0200, Jan Beulich wrote:
>>>> All of the array allocations in grant_table_init() can exceed a page's
>>>> worth of memory, which xmalloc()-based interfaces aren't really suitable
>>>> for after boot. We also don't need any of these allocations to be
>>>> physically contiguous.. Introduce interfaces dynamically switching
>>>> between xmalloc() et al and vmalloc() et al, based on requested size,
>>>> and use them instead.
>>>>
>>>> All the wrappers in the new header get cloned mostly verbatim from
>>>> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
>>>> for sizes and to unsigned int for alignments.
>>>
>>> We seem to be growing a non-trivial amount of memory allocation
>>> families of functions: xmalloc, vmalloc and now xvmalloc.
>>>
>>> I think from a consumer PoV it would make sense to only have two of
>>> those: one for allocations that require to be physically contiguous,
>>> and one for allocation that don't require it.
>>>
>>> Even then, requesting for physically contiguous allocations could be
>>> done by passing a flag to the same interface that's used for
>>> non-contiguous allocations.
>>>
>>> Maybe another option would be to expand the existing
>>> v{malloc,realloc,...} set of functions to have your proposed behaviour
>>> for xv{malloc,realloc,...}?
>>
>> All of this and some of your remarks further down has already been
>> discussed. A working group has been formed. No progress since. Yes,
>> a smaller set of interfaces may be the way to go. Controlling
>> behavior via flags, otoh, is very much not malloc()-like. Making
>> existing functions have the intended new behavior is a no-go without
>> auditing all present uses, to find those few which actually may need
>> physically contiguous allocations.
> 
> But you could make your proposed xvmalloc logic the implementation
> behind vmalloc, as that would still be perfectly fine and safe? (ie:
> existing users of vmalloc already expect non-physically contiguous
> memory). You would just optimize the size < PAGE_SIZE for that
> interface?

Existing callers of vmalloc() may expect page alignment of the
returned address.

>>>> --- /dev/null
>>>> +++ b/xen/include/xen/xvmalloc.h
>>>> @@ -0,0 +1,73 @@
>>>> +
>>>> +#ifndef __XVMALLOC_H__
>>>> +#define __XVMALLOC_H__
>>>> +
>>>> +#include <xen/cache.h>
>>>> +#include <xen/types.h>
>>>> +
>>>> +/*
>>>> + * Xen malloc/free-style interface for allocations possibly exceeding a page's
>>>> + * worth of memory, as long as there's no need to have physically contiguous
>>>> + * memory allocated.  These should be used in preference to xmalloc() et al
>>>> + * whenever the size is not known to be constrained to at most a single page.
>>>
>>> Even when it's know that size <= PAGE_SIZE this helpers are
>>> appropriate as they would end up using xmalloc, so I think it's fine to
>>> recommend them universally as long as there's no need to alloc
>>> physically contiguous memory?
>>>
>>> Granted there's a bit more overhead from the logic to decide between
>>> using xmalloc or vmalloc &c, but that's IMO not that big of a deal in
>>> order to not recommend this interface globally for non-contiguous
>>> alloc.
>>
>> As long as xmalloc() and vmalloc() are meant stay around as separate
>> interfaces, I wouldn't want to "forbid" their use when it's sufficiently
>> clear that they would be chosen by the new function anyway. Otoh, if the
>> new function became more powerful in terms of falling back to the
> 
> What do you mean with more powerful here?

Well, right now the function is very simplistic, looking just at the size
and doing no fallback attempts at all. Linux'es kvmalloc() goes a little
farther. What I see as an option is for either form of allocation to fall
back to the other form in case the first attempt fails. This would cover
- out of memory Xen heap for small allocs,
- out of VA space for large allocs.
And of course, like Linux does (or at least did at the time I looked at
their code), the choice which of the backing functions to call could also
become more sophisticated over time.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/22] x86/xstate: drop xstate_offsets[] and xstate_sizes[]
  2021-04-22 14:45 ` [PATCH v3 05/22] x86/xstate: drop xstate_offsets[] and xstate_sizes[] Jan Beulich
@ 2021-05-03 16:10   ` Andrew Cooper
  2021-05-04  7:57     ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-05-03 16:10 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: George Dunlap, Wei Liu, Roger Pau Monné

On 22/04/2021 15:45, Jan Beulich wrote:
> They're redundant with respective fields from the raw CPUID policy; no
> need to keep two copies of the same data.

So before I read this patch of yours, I had a separate cleanup patch
turning the two arrays into static const.

> This also breaks
> recalculate_xstate()'s dependency on xstate_init(),

It doesn't, because you've retained the reference to xstate_align, which
is calculated in xstate_init().  I've posted "[PATCH 4/5] x86/cpuid:
Simplify recalculate_xstate()" which goes rather further.

xstate_align, and xstate_xfd as you've got later in the series, don't
need to be variables.  They're constants, just like the offset/size
information, because they're all a description of the XSAVE ISA
instruction behaviour.

We never turn on states we don't understand, which means we don't
actually need to refer to any component subleaf, other than to cross-check.

I'm still on the fence as to whether it is better to compile in the
constants, or to just use the raw policy.  Absolutely nothing good will
come of the constants changing, and one of my backup plans for dealing
with the size of cpuid_policy if it becomes a problem was to not store
these leaves, and generate them dynamically on request.

> allowing host CPUID
> policy calculation to be moved together with that of the raw one (which
> a subsequent change will require anyway).

While breaking up the host/raw calculations from the rest, we really
need to group the MSR policy calculations with their CPUID counterparts.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations
  2021-05-03 15:21         ` Jan Beulich
@ 2021-05-03 16:39           ` Roger Pau Monné
  0 siblings, 0 replies; 40+ messages in thread
From: Roger Pau Monné @ 2021-05-03 16:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Andrew Cooper, George Dunlap, Ian Jackson,
	Julien Grall, Stefano Stabellini, Wei Liu

On Mon, May 03, 2021 at 05:21:37PM +0200, Jan Beulich wrote:
> On 03.05.2021 16:54, Roger Pau Monné wrote:
> > On Mon, May 03, 2021 at 03:50:48PM +0200, Jan Beulich wrote:
> >> On 03.05.2021 13:31, Roger Pau Monné wrote:
> >>> On Thu, Apr 22, 2021 at 04:43:39PM +0200, Jan Beulich wrote:
> >>>> All of the array allocations in grant_table_init() can exceed a page's
> >>>> worth of memory, which xmalloc()-based interfaces aren't really suitable
> >>>> for after boot. We also don't need any of these allocations to be
> >>>> physically contiguous.. Introduce interfaces dynamically switching
> >>>> between xmalloc() et al and vmalloc() et al, based on requested size,
> >>>> and use them instead.
> >>>>
> >>>> All the wrappers in the new header get cloned mostly verbatim from
> >>>> xmalloc.h, with the sole adjustment to switch unsigned long to size_t
> >>>> for sizes and to unsigned int for alignments.
> >>>
> >>> We seem to be growing a non-trivial amount of memory allocation
> >>> families of functions: xmalloc, vmalloc and now xvmalloc.
> >>>
> >>> I think from a consumer PoV it would make sense to only have two of
> >>> those: one for allocations that require to be physically contiguous,
> >>> and one for allocation that don't require it.
> >>>
> >>> Even then, requesting for physically contiguous allocations could be
> >>> done by passing a flag to the same interface that's used for
> >>> non-contiguous allocations.
> >>>
> >>> Maybe another option would be to expand the existing
> >>> v{malloc,realloc,...} set of functions to have your proposed behaviour
> >>> for xv{malloc,realloc,...}?
> >>
> >> All of this and some of your remarks further down has already been
> >> discussed. A working group has been formed. No progress since. Yes,
> >> a smaller set of interfaces may be the way to go. Controlling
> >> behavior via flags, otoh, is very much not malloc()-like. Making
> >> existing functions have the intended new behavior is a no-go without
> >> auditing all present uses, to find those few which actually may need
> >> physically contiguous allocations.
> > 
> > But you could make your proposed xvmalloc logic the implementation
> > behind vmalloc, as that would still be perfectly fine and safe? (ie:
> > existing users of vmalloc already expect non-physically contiguous
> > memory). You would just optimize the size < PAGE_SIZE for that
> > interface?
> 
> Existing callers of vmalloc() may expect page alignment of the
> returned address.

Right - just looked and also the interface is different from
x{v}malloc, so you would have to fixup callers.

> >>>> --- /dev/null
> >>>> +++ b/xen/include/xen/xvmalloc.h
> >>>> @@ -0,0 +1,73 @@
> >>>> +
> >>>> +#ifndef __XVMALLOC_H__
> >>>> +#define __XVMALLOC_H__
> >>>> +
> >>>> +#include <xen/cache.h>
> >>>> +#include <xen/types.h>
> >>>> +
> >>>> +/*
> >>>> + * Xen malloc/free-style interface for allocations possibly exceeding a page's
> >>>> + * worth of memory, as long as there's no need to have physically contiguous
> >>>> + * memory allocated.  These should be used in preference to xmalloc() et al
> >>>> + * whenever the size is not known to be constrained to at most a single page.
> >>>
> >>> Even when it's know that size <= PAGE_SIZE this helpers are
> >>> appropriate as they would end up using xmalloc, so I think it's fine to
> >>> recommend them universally as long as there's no need to alloc
> >>> physically contiguous memory?
> >>>
> >>> Granted there's a bit more overhead from the logic to decide between
> >>> using xmalloc or vmalloc &c, but that's IMO not that big of a deal in
> >>> order to not recommend this interface globally for non-contiguous
> >>> alloc.
> >>
> >> As long as xmalloc() and vmalloc() are meant stay around as separate
> >> interfaces, I wouldn't want to "forbid" their use when it's sufficiently
> >> clear that they would be chosen by the new function anyway. Otoh, if the
> >> new function became more powerful in terms of falling back to the
> > 
> > What do you mean with more powerful here?
> 
> Well, right now the function is very simplistic, looking just at the size
> and doing no fallback attempts at all. Linux'es kvmalloc() goes a little
> farther. What I see as an option is for either form of allocation to fall
> back to the other form in case the first attempt fails. This would cover
> - out of memory Xen heap for small allocs,
> - out of VA space for large allocs.
> And of course, like Linux does (or at least did at the time I looked at
> their code), the choice which of the backing functions to call could also
> become more sophisticated over time.

I'm not opposed to any of this, but even your proposed code right now
seems no worse than using either vmalloc or xmalloc, as it's only a
higher level wrapper around those.

What I would prefer is to propose to use function foo for all
allocations that don't require contiguous physical memory, and
function bar for those that do require contiguous physical memory.
It's IMO awkward from a developer PoV to have to select an
allocation function based on the size to be allocated.

I wouldn't mind if you wanted to name this more generic wrapped straight
malloc.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/22] x86/xstate: drop xstate_offsets[] and xstate_sizes[]
  2021-05-03 16:10   ` Andrew Cooper
@ 2021-05-04  7:57     ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-05-04  7:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Wei Liu, Roger Pau Monné, xen-devel

On 03.05.2021 18:10, Andrew Cooper wrote:
> On 22/04/2021 15:45, Jan Beulich wrote:
>> They're redundant with respective fields from the raw CPUID policy; no
>> need to keep two copies of the same data.
> 
> So before I read this patch of yours, I had a separate cleanup patch
> turning the two arrays into static const.
> 
>> This also breaks
>> recalculate_xstate()'s dependency on xstate_init(),
> 
> It doesn't, because you've retained the reference to xstate_align, which
> is calculated in xstate_init().

Good point - s/breaks/eliminates some of/.

>  I've posted "[PATCH 4/5] x86/cpuid:
> Simplify recalculate_xstate()" which goes rather further.

I'll see to take a look soonish.

> xstate_align, and xstate_xfd as you've got later in the series, don't
> need to be variables.  They're constants, just like the offset/size
> information, because they're all a description of the XSAVE ISA
> instruction behaviour.

Hmm, I think there are multiple views possible - for xfd_mask even more
than for xstate_align: XFD is, according to my understanding of the
spec, not a prereq feature to AMX. IOW AMX would function fine without
XFD, just that lazy state saving space allocation then wouldn't be
possible. And I also can't, in principle, see any reason why largish
components like the AVX512 ones couldn't become XFD-sensitive (in
hardware, we of course can't mimic this in software).

(I could take as proof sde reporting AMX but not XFD with -spr, but I
rather suspect this to be an oversight in their CPUID data. I've posted
a respective question in their forum.)

If there really was a strict static relationship, I'm having trouble
seeing why the information would need expressing in CPUID at all. It
would at least feel like over-engineering then.

> We never turn on states we don't understand, which means we don't
> actually need to refer to any component subleaf, other than to cross-check.
> 
> I'm still on the fence as to whether it is better to compile in the
> constants, or to just use the raw policy.  Absolutely nothing good will
> come of the constants changing, and one of my backup plans for dealing
> with the size of cpuid_policy if it becomes a problem was to not store
> these leaves, and generate them dynamically on request.

Actually it is my understanding that the reason the offsets are
expressed via CPUID is that originally it was meant for them to be
able to vary between implementations (see in particular the placement
of the LWP component, which has resulted in a curious 128-byte gap
ahead of the MPX components). Until it was realized what implications
this would have on migration.

>> allowing host CPUID
>> policy calculation to be moved together with that of the raw one (which
>> a subsequent change will require anyway).
> 
> While breaking up the host/raw calculations from the rest, we really
> need to group the MSR policy calculations with their CPUID counterparts.

But that's orthogonal to the change here, i.e. if at all for this
series subject of a separate patch. Plus I have to admit I'm not
sure I see what your plan here would be - cpuid.c and msr.c so far
don't cross reference one another. And I thought this separation
was intentional.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 02/22] x86/xstate: use xvzalloc() for save area allocation
  2021-04-22 14:44 ` [PATCH v3 02/22] x86/xstate: use xvzalloc() for save area allocation Jan Beulich
@ 2021-05-05 13:29   ` Roger Pau Monné
  0 siblings, 0 replies; 40+ messages in thread
From: Roger Pau Monné @ 2021-05-05 13:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, George Dunlap, Wei Liu

On Thu, Apr 22, 2021 at 04:44:36PM +0200, Jan Beulich wrote:
> This is in preparation for the area size exceeding a page's worth of
> space, as will happen with AMX as well as Architectural LBR.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Even if the naming of the new helpers (_xvzalloc) is changed.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes
  2021-05-03 14:22     ` Jan Beulich
@ 2021-05-11 16:41       ` Andrew Cooper
  2021-05-17  7:33         ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-05-11 16:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Ian Jackson, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 03/05/2021 15:22, Jan Beulich wrote:
>> Another consequence is that we need to rethink our hypercall behaviour. 
>> There is no such thing as supervisor states in an uncompressed XSAVE
>> image, which means we can't continue with that being the ABI.
> I don't think the hypercall input / output blob needs to follow any
> specific hardware layout.

Currently, the blob is { xcr0, xcr0_accum, uncompressed image }.

As we haven't supported any compressed states yet, we are at liberty to
create a forward compatible change by logically s/xcr0/xstate/ and
permitting an uncompressed image.

Irritatingly, we have xcr0=0 as a permitted state and out in the field,
for "no xsave state".  This contributes a substantial quantity of
complexity in our xstate logic, and invalidates the easy fix I had for
not letting the HVM initpath explode.

The first task is to untangle the non-architectural xcr0=0 case, and to
support compressed images.  Size parsing needs to be split into two, as
for compressed images, we need to consume XSTATE_BV and XCOMP_BV to
cross-check the size.

I think we also want a rule that Xen will always send compressed if it
is using XSAVES (/XSAVEC in the interim?)  We do not want to be working
with uncompressed images at all, now that MPX is a reasonable sized hole
in the middle.

Cleaning this up will then unblock v2 of the existing xstate cleanup
series I posted.

>> In terms of actual context switching, we want to be using XSAVES/XRSTORS
>> whenever it is available, even if we're not using supervisor states. 
>> XSAVES has both the inuse and modified optimisations, without the broken
>> consequence of XSAVEOPT (which is firmly in the "don't ever use this"
>> bucket now).
> The XSAVEOPT anomaly is affecting user mode only, isn't it? Or are
> you talking of something I have forgot about?

It's not safe to use at all in L1 xen, because the tracking leaks
between non-root contexts.  I can't remember if there are further
problems for an L0 xen, but I have a nagging feeling that there is.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes
  2021-05-11 16:41       ` Andrew Cooper
@ 2021-05-17  7:33         ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-05-17  7:33 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: George Dunlap, Ian Jackson, Julien Grall, Stefano Stabellini,
	Wei Liu, Roger Pau Monné,
	xen-devel

On 11.05.2021 18:41, Andrew Cooper wrote:
> On 03/05/2021 15:22, Jan Beulich wrote:
>>> Another consequence is that we need to rethink our hypercall behaviour. 
>>> There is no such thing as supervisor states in an uncompressed XSAVE
>>> image, which means we can't continue with that being the ABI.
>> I don't think the hypercall input / output blob needs to follow any
>> specific hardware layout.
> 
> Currently, the blob is { xcr0, xcr0_accum, uncompressed image }.
> 
> As we haven't supported any compressed states yet, we are at liberty to
> create a forward compatible change by logically s/xcr0/xstate/ and
> permitting an uncompressed image.
> 
> Irritatingly, we have xcr0=0 as a permitted state and out in the field,
> for "no xsave state".  This contributes a substantial quantity of
> complexity in our xstate logic, and invalidates the easy fix I had for
> not letting the HVM initpath explode.
> 
> The first task is to untangle the non-architectural xcr0=0 case, and to
> support compressed images.  Size parsing needs to be split into two, as
> for compressed images, we need to consume XSTATE_BV and XCOMP_BV to
> cross-check the size.

Not sure about the need to eliminate the xcr0=0 (or xstates=0) case.
Which isn't to say I'm opposed if you want to do so and it's not
overly intrusive.

> I think we also want a rule that Xen will always send compressed if it
> is using XSAVES (/XSAVEC in the interim?)

If this is sufficiently neutral to tool stack code, why not (albeit
I don't think there needs to be a "rule" - Xen should be free to
provide what it deems best, with consumers in the position to easily
recognize the format; similarly Xen should be consuming whatever it
gets handed, as long as that's valid state). Luckily the layout is
visible just through tool-stack-only interfaces.

>  We do not want to be working
> with uncompressed images at all, now that MPX is a reasonable sized hole
> in the middle.

They're together no larger (128 bytes) than the LWP hole right ahead
of them (at 0x340). I agree avoiding uncompressed format is worthwhile,
but perhaps quite a bit more so for systems with higher components
following unavailable even bigger ones.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-05-17  7:33 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-22 14:38 [PATCH v3 00/22] xvmalloc() / x86 xstate area / x86 CPUID / AMX+XFD Jan Beulich
2021-04-22 14:43 ` [PATCH v3 01/22] mm: introduce xvmalloc() et al and use for grant table allocations Jan Beulich
2021-05-03 11:31   ` Roger Pau Monné
2021-05-03 13:50     ` Jan Beulich
2021-05-03 14:54       ` Roger Pau Monné
2021-05-03 15:21         ` Jan Beulich
2021-05-03 16:39           ` Roger Pau Monné
2021-04-22 14:44 ` [PATCH v3 02/22] x86/xstate: use xvzalloc() for save area allocation Jan Beulich
2021-05-05 13:29   ` Roger Pau Monné
2021-04-22 14:44 ` [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes Jan Beulich
2021-05-03 13:57   ` Andrew Cooper
2021-05-03 14:22     ` Jan Beulich
2021-05-11 16:41       ` Andrew Cooper
2021-05-17  7:33         ` Jan Beulich
2021-04-22 14:45 ` [PATCH v3 04/22] x86/xstate: re-use valid_xcr0() for boot-time checks Jan Beulich
2021-05-03 11:53   ` Andrew Cooper
2021-04-22 14:45 ` [PATCH v3 05/22] x86/xstate: drop xstate_offsets[] and xstate_sizes[] Jan Beulich
2021-05-03 16:10   ` Andrew Cooper
2021-05-04  7:57     ` Jan Beulich
2021-04-22 14:46 ` [PATCH v3 06/22] x86/xstate: replace xsave_cntxt_size and drop XCNTXT_MASK Jan Beulich
2021-04-22 14:47 ` [PATCH v3 07/22] x86/xstate: avoid accounting for unsupported components Jan Beulich
2021-04-22 14:47 ` [PATCH v3 08/22] x86: use xvmalloc() for extended context buffer allocations Jan Beulich
2021-04-22 14:48 ` [PATCH v3 09/22] x86/xstate: enable AMX components Jan Beulich
2021-04-22 14:50 ` [PATCH v3 10/22] x86/CPUID: adjust extended leaves out of range clearing Jan Beulich
2021-04-22 14:50 ` [PATCH v3 11/22] x86/CPUID: move bounding of max_{,sub}leaf fields to library code Jan Beulich
2021-04-22 14:51 ` [PATCH v3 12/22] x86/CPUID: enable AMX leaves Jan Beulich
2021-04-22 14:52 ` [PATCH v3 13/22] x86: XFD enabling Jan Beulich
2021-04-22 14:53 ` [PATCH v3 14/22] x86emul: introduce X86EMUL_FPU_{tilecfg,tile} Jan Beulich
2021-04-22 14:53 ` [PATCH v3 15/22] x86emul: support TILERELEASE Jan Beulich
2021-04-22 14:53 ` [PATCH v3 16/22] x86: introduce struct for TILECFG register Jan Beulich
2021-04-22 14:54 ` [PATCH v3 17/22] x86emul: support {LD,ST}TILECFG Jan Beulich
2021-04-22 14:55 ` [PATCH v3 18/22] x86emul: support TILEZERO Jan Beulich
2021-04-22 14:55 ` [PATCH v3 19/22] x86emul: support TILELOADD{,T1} and TILESTORE Jan Beulich
2021-04-22 15:06   ` Jan Beulich
2021-04-22 15:11     ` Jan Beulich
2021-04-26  7:12       ` Paul Durrant
2021-04-29  9:40         ` Jan Beulich
2021-04-22 14:56 ` [PATCH v3 20/22] x86emul: support tile multiplication insns Jan Beulich
2021-04-22 14:57 ` [PATCH v3 21/22] x86emul: test AMX insns Jan Beulich
2021-04-22 14:57 ` [PATCH v3 22/22] x86: permit guests to use AMX and XFD Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).