Re: [PATCH 1/6] x86/vmx: Fix handing of MSR_DEBUGCTL on VMExit

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Kevin Tian <kevin.tian@intel.com>,
	Xen-devel <xen-devel@lists.xen.org>,
	Wei Liu <wei.liu2@citrix.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: [PATCH 1/6] x86/vmx: Fix handing of MSR_DEBUGCTL on VMExit
Date: Tue, 29 May 2018 19:08:27 +0100	[thread overview]
Message-ID: <d0cc97da-609e-6ef9-2479-d72eb70e6e5d@citrix.com> (raw)
In-Reply-To: <5B0D2C6502000078001C685A@prv1-mh.provo.novell.com>

On 29/05/18 11:33, Jan Beulich wrote:
>>>> On 28.05.18 at 16:27, <andrew.cooper3@citrix.com> wrote:
>> Currently, whenever the guest writes a nonzero value to MSR_DEBUGCTL, Xen
>> updates a host MSR load list entry with the current hardware value of
>> MSR_DEBUGCTL.  This is wrong.
> "This is wrong" goes too far for my taste: It is not very efficient to do it that
> way, but it's still correct. Unless, of course, the zeroing of the register
> happens after the processing of the MSR load list (which I doubt it does).

It is functionally broken.  Restoration of Xen's debugging setting must
happen from the first vmexit, not the first vmexit after the guest plays
with MSR_DEBUGCTL.

With the current behaviour, Xen looses its MSR_DEBUGCTL setting on any
pcpu where an HVM guest has been scheduled, and then feeds the current
value (0) into the host load list, even when it was attempting to set a
non-zero value.

>
>> Initially, I tried to have a common xen_msr_debugctl variable, but
>> rip-relative addresses don't resolve correctly in alternative blocks.
>> LBR-only has been fine for ages, and I don't see that changing any time 
>> soon.
> The chosen solution is certainly fine, but the issue could have been
> avoided by doing the load from memory ahead of the alternative block
> (accepting that it also happens when the value isn't actually needed).
>
> Another option would be to invert the sense of the feature flag,
> patching NOPs over the register setup plus WRMSR.

I considered both, but until it is necessary, there is little point.

>
>> @@ -1764,17 +1765,6 @@ void do_device_not_available(struct cpu_user_regs *regs)
>>      return;
>>  }
>>  
>> -static void ler_enable(void)
>> -{
>> -    u64 debugctl;
>> -
>> -    if ( !this_cpu(ler_msr) )
>> -        return;
>> -
>> -    rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
>> -    wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl | IA32_DEBUGCTLMSR_LBR);
>> -}
>> -
>>  void do_debug(struct cpu_user_regs *regs)
>>  {
>>      unsigned long dr6;
>> @@ -1870,13 +1860,13 @@ void do_debug(struct cpu_user_regs *regs)
>>      v->arch.debugreg[6] |= (dr6 & ~X86_DR6_DEFAULT);
>>      v->arch.debugreg[6] &= (dr6 | ~X86_DR6_DEFAULT);
>>  
>> -    ler_enable();
>>      pv_inject_hw_exception(TRAP_debug, X86_EVENT_NO_EC);
>> -    return;
>>  
>>   out:
>> -    ler_enable();
>> -    return;
>> +
>> +    /* #DB automatically disabled LBR.  Reinstate it if debugging Xen. */
>> +    if ( cpu_has_xen_lbr )
>> +        wrmsrl(MSR_IA32_DEBUGCTLMSR, IA32_DEBUGCTLMSR_LBR);
> While I can see that we don't currently need anything more than this one
> bit, it still doesn't feel overly well to not do a read-modify-write cycle here.

We should never be using a RMW cycle.  All that risks doing is
accumulating unexpected debugging controls.

If/when it becomes a variable, the correct code here is:

if ( xen_debugctl_val & IA32_DEBUGCTLMSR_LBR )
    wrmsrl(MSR_IA32_DEBUGCTLMSR, xen_debugctl_val);

(except that since writing this patch, I've found that BTF is also
cleared on AMD hardware, so that probably wants to be taken into account).

> In any event, rather than moving the write further towards the end of
> the function, could I ask you to move it further up, so that in the (unlikely)
> event of do_debug() itself triggering an exception we'd get a proper
> indication of the last branch before that?

Ok.  It can move to immediately after resetting %dr6.

>
>> @@ -1920,38 +1910,46 @@ void load_TR(void)
>>          : "=m" (old_gdt) : "rm" (TSS_ENTRY << 3), "m" (tss_gdt) : "memory" );
>>  }
>>  
>> -void percpu_traps_init(void)
>> +static uint32_t calc_ler_msr(void)
> Here and elsewhere "unsigned int" would be more appropriate to use.
> We don't require MSR indexes to be exactly 32 bits wide, but only at
> least as wide.

MSR indices are architecturally 32 bits wide.

>
>> +void percpu_traps_init(void)
>> +{
>> +    subarch_percpu_traps_init();
>> +
>> +    if ( !opt_ler )
>> +        return;
>> +
>> +    if ( !ler_msr && (ler_msr = calc_ler_msr()) )
>> +        setup_force_cpu_cap(X86_FEATURE_XEN_LBR);
> This does not hold up with the promise the description makes: If running
> on an unrecognized model, calc_ler_msr() is going to be called more than
> once. If it really was called just once, it could also become __init. With
> the inverted sense of the feature flag (as suggested above) you could
> check whether the flag bit is set or ler_msr is non-zero.

Hmm - I suppose it doesn't quite match the description, but does it
matter (if I tweak the description)?  It is debugging functionality, and
I don't see any 64bit models missing from the list.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel