Ping: [PATCH] x86emul: de-duplicate scatters to the same linear address

From: Jan Beulich <jbeulich@suse.com>
To: "Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Roger Pau Monné" <roger.pau@citrix.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	Wei Liu <wl@xen.org>
Subject: Ping: [PATCH] x86emul: de-duplicate scatters to the same linear address
Date: Fri, 5 Feb 2021 11:10:21 +0100	[thread overview]
Message-ID: <a9739f4a-cfc6-673c-f2a7-a21723eda199@suse.com> (raw)
In-Reply-To: <6064996d-943f-1be3-9bfd-e872149da2a1@suse.com>

On 10.11.2020 14:26, Jan Beulich wrote:
> The SDM specifically allows for earlier writes to fully overlapping
> ranges to be dropped. If a guest did so, hvmemul_phys_mmio_access()
> would crash it if varying data was written to the same address. Detect
> overlaps early, as doing so in hvmemul_{linear,phys}_mmio_access() would
> be quite a bit more difficult.
> 
> Note that due to cache slot use being linear address based, there's no
> similar issue with multiple writes to the same physical address (mapped
> through different linear addresses).
> 
> Since this requires an adjustment to the EVEX Disp8 scaling test,
> correct a comment there at the same time.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: The SDM isn't entirely unambiguous about the faulting behavior in
>      this case: If a fault would need delivering on the earlier slot
>      despite the write getting squashed, we'd have to call ops->write()
>      with size set to zero for the earlier write(s). However,
>      hvm/emulate.c's handling of zero-byte accesses extends only to the
>      virtual-to-linear address conversions (and raising of involved
>      faults), so in order to also observe #PF changes to that logic
>      would then also be needed. Can we live with a possible misbehavior
>      here?
> 
> --- a/tools/tests/x86_emulator/evex-disp8.c
> +++ b/tools/tests/x86_emulator/evex-disp8.c
> @@ -647,8 +647,8 @@ static const uint16_t imm0f[16] = {
>  static struct x86_emulate_ops emulops;
>  
>  /*
> - * Access tracking (by granular) is used on the first 64 bytes of address
> - * space. Instructions get encode with a raw Disp8 value of 1, which then
> + * Access tracking (byte granular) is used on the first bytes of address
> + * space. Instructions get encoded with a raw Disp8 value of 1, which then
>   * gets scaled accordingly. Hence accesses below the address <scaling factor>
>   * as well as at or above 2 * <scaling factor> are indications of bugs. To
>   * aid diagnosis / debugging, track all accesses below 3 * <scaling factor>.
> @@ -804,6 +804,31 @@ static void test_one(const struct test *
>  
>      asm volatile ( "kxnorw %k1, %k1, %k1" );
>      asm volatile ( "vxorps %xmm4, %xmm4, %xmm4" );
> +    if ( sg && (test->opc | 3) == 0xa3 )
> +    {
> +        /*
> +         * Non-prefetch scatters need handling specially, due to the
> +         * overlapped write elimination done by the emulator. The order of
> +         * indexes chosen is somewhat arbitrary, within the constraints
> +         * imposed by the various different uses.
> +         */
> +        static const struct {
> +            int32_t _[16];
> +        } off32 = {{ 1, 0, 2, 3, 7, 6, 5, 4, 8, 10, 12, 14, 9, 11, 13, 15 }};
> +
> +        if ( test->opc & 1 )
> +        {
> +            asm volatile ( "vpmovsxdq %0, %%zmm4" :: "m" (off32) );
> +            vsz >>= !evex.w;
> +        }
> +        else
> +            asm volatile ( "vmovdqu32 %0, %%zmm4" :: "m" (off32) );
> +
> +        /* Scale by element size. */
> +        instr[6] |= (evex.w | 2) << 6;
> +
> +        sg = false;
> +    }
>  
>      ctxt->regs->eip = (unsigned long)&instr[0];
>      ctxt->regs->edx = 0;
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -10032,25 +10032,47 @@ x86_emulate(
>  
>          for ( i = 0; op_mask; ++i )
>          {
> -            long idx = b & 1 ? index.qw[i] : index.dw[i];
> +            long idx = (b & 1 ? index.qw[i]
> +                              : index.dw[i]) * (1 << state->sib_scale);
> +            unsigned long offs = truncate_ea(ea.mem.off + idx);
> +            unsigned int j;
>  
>              if ( !(op_mask & (1 << i)) )
>                  continue;
>  
> -            rc = ops->write(ea.mem.seg,
> -                            truncate_ea(ea.mem.off +
> -                                        idx * (1 << state->sib_scale)),
> -                            (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
> -            if ( rc != X86EMUL_OKAY )
> +            /*
> +             * hvmemul_linear_mmio_access() will find a cache slot based on
> +             * linear address. hvmemul_phys_mmio_access() will crash the
> +             * domain if observing varying data getting written to the same
> +             * address within a cache slot. Utilize that squashing earlier
> +             * writes to fully overlapping addresses is permitted by the spec.
> +             */
> +            for ( j = i + 1; j < n; ++j )
>              {
> -                /* See comment in gather emulation. */
> -                if ( rc != X86EMUL_EXCEPTION && done )
> -                    rc = X86EMUL_RETRY;
> -                break;
> +                long idx2 = (b & 1 ? index.qw[j]
> +                                   : index.dw[j]) * (1 << state->sib_scale);
> +
> +                if ( (op_mask & (1 << j)) &&
> +                     truncate_ea(ea.mem.off + idx2) == offs )
> +                    break;
> +            }
> +
> +            if ( j >= n )
> +            {
> +                rc = ops->write(ea.mem.seg, offs,
> +                                (void *)mmvalp + i * op_bytes, op_bytes, ctxt);
> +                if ( rc != X86EMUL_OKAY )
> +                {
> +                    /* See comment in gather emulation. */
> +                    if ( rc != X86EMUL_EXCEPTION && done )
> +                        rc = X86EMUL_RETRY;
> +                    break;
> +                }
> +
> +                done = true;
>              }
>  
>              op_mask &= ~(1 << i);
> -            done = true;
>  
>  #ifdef __XEN__
>              if ( op_mask && local_events_need_delivery() )
>