From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mohit Gambhir <mohit.gambhir@oracle.com>
Subject: Re: [PATCH] x86/vpmu_intel: Fix hypervisor crash by
 catching wrmsr fault
Date: Mon, 24 Apr 2017 11:44:16 -0400
Message-ID: <73ddd600-b14c-257f-9322-33a516e050a2@oracle.com>
References: <20170420174942.12913-1-mohit.gambhir@oracle.com>
 <20170420174942.12913-2-mohit.gambhir@oracle.com>
 <58F9CD740200007800152B3E@prv-mh.provo.novell.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============3744714304666097054=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <58F9CD740200007800152B3E@prv-mh.provo.novell.com>
List-Unsubscribe: <https://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <https://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
To: Jan Beulich <JBeulich@suse.com>
Cc: kevin.tian@intel.com, boris.ostrovsky@oracle.com, jun.nakajima@intel.com, xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

This is a multi-part message in MIME format.
--===============3744714304666097054==
Content-Type: multipart/alternative;
 boundary="------------0892E7CBDE0B437DA54D6713"

This is a multi-part message in MIME format.
--------------0892E7CBDE0B437DA54D6713
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit


On 04/21/2017 03:14 AM, Jan Beulich wrote:
>>>> On 20.04.17 at 19:49,<mohit.gambhir@oracle.com>  wrote:
>> This patch changes wrmsrl() calls to write to MSR_P6_EVTSEL register in the
>> VPMU to wrmsr_safe(). There are known (and possibly some unknown) cases where
>> setting certain bits in MSR_P6_EVTSEL reg. can cause a General Protection
>> fault on some machines. Unless we catch this fault when it happens, it will
>> result in a hypervisor crash.
>>
>> For instance, setting Pin Control (PC) bit (19) in MSR_P6_EVNTSEL results
>> in a General Protection Fault on Broadwell machines and thus causes the
>> hypervisor to crash.
> I can't seem to find indication of this in the SDM. Is this removal of a
> valid bit from an architectural MSR documented anywhere?
It see it as an observed hardware bug so I don't expect it to be found 
in the SDM.  I reached out to
the folks at Intel (CC) but they are unaware of it. Ideally it should 
work the way it says in the SDM
for all events and on all machines but it doesn't - not for the all the 
events and certainly not on all
machines.

Also, from the description in the SDM, PC flag bit it seems very 
disruptive to me.
SDM says that if the bit is set then the processor toggles the PMi pin 
(generating a performance monitoring interrupt?)
every time the event occurs. So if we program the counter to count 
"unhaulted core cycles", and set PC flag bit
it will generate an interrupts every cycle!?

>> @@ -356,7 +356,9 @@ static inline void __core2_vpmu_load(struct vcpu *v)
>>       for ( i = 0; i < arch_pmc_cnt; i++ )
>>       {
>>           wrmsrl(pmc_start + i, xen_pmu_cntr_pair[i].counter);
>> -        wrmsrl(MSR_P6_EVNTSEL(i), xen_pmu_cntr_pair[i].control);
>> +        if ( wrmsr_safe(MSR_P6_EVNTSEL(i), xen_pmu_cntr_pair[i].control) ==
>> +                -EFAULT)
>> +            return -EFAULT;
> If the function already returns an error code, please use it to avoid
> a latent bug when in the future it may also return other than -EFAULT.
Fixed in next version.
>> @@ -369,6 +371,7 @@ static inline void __core2_vpmu_load(struct vcpu *v)
>>           core2_vpmu_cxt->global_ovf_ctrl = 0;
>>           wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, core2_vpmu_cxt->global_ctrl);
>>       }
>> +    return 0;
> Blank line above the main return statement of a function please.
> pv_inject_hw_exception(TRAP_gp_fault
Fixed in next version.
>> @@ -461,9 +464,8 @@ static int core2_vpmu_load(struct vcpu *v, bool_t
>> from_guest)
>>   
>>       vpmu_set(vpmu, VPMU_CONTEXT_LOADED);
>>   
>> -    __core2_vpmu_load(v);
>> +    return __core2_vpmu_load(v);
>>   
>> -    return 0;
>>   }
> Please also remove the now stray blank line.
>
Fixed in next version.
>> @@ -538,7 +540,8 @@ static int core2_vpmu_msr_common_check(u32 msr_index, int *type, int *index)
>>       /* Do the lazy load staff. */
>>       if ( !vpmu_is_set(vpmu, VPMU_CONTEXT_LOADED) )
>>       {
>> -        __core2_vpmu_load(current);
>> +        if ( __core2_vpmu_load(current) )
>> +            return 0;
> Nothing you need to change, but the change here point out that
> the function really needs switching to returning bool.
Agreed.
>> @@ -719,8 +722,11 @@ static int core2_vpmu_do_wrmsr(unsigned int msr, uint64_t msr_content,
>>           }
>>       }
>>   
>> -    if ( type != MSR_TYPE_GLOBAL )
>> -        wrmsrl(msr, msr_content);
>> +    if ( type != MSR_TYPE_GLOBAL)
>> +    {
>> +        if ( wrmsr_safe(msr, msr_content) == -EFAULT )
>> +            return -EFAULT;
> See the earlier similar comment.
Fixed in next verison.
> Further a more general question: Why do you do this only for
> MSR_P6_EVNTSEL*, but not consistently for all MSRs which are
> being written with (partially) guest controlled values? I'd namely
> expect all wrmsrl() to be switched over, unless it is clear that
> we're dealing with an erratum here and we therefore can
> assume that normal guarantees for architectural MSRs apply to
> everything else.
I think it is an erratum for one specific MSR which is MSR_P6_EVTSEL
> And finally, in the description you talk about forwarding the
> fault - I can see this being the case for core2_vpmu_do_wrmsr(),
> but how does this match up with __core2_vpmu_load() and its
> callers? Namely you may report an error there after already
> having updated some MSRs. Leaving the overall set of MSRs in
> an inconsistent state is not very likely to be a good idea.
Yes, that's something I did not think about and needs to be fixed. There 
are two issues here -

1. What should we set MSR values in case of an error? We can not revert 
them because if it is part of a
context switch  then we do not want to leak the values from the previous 
guest to the new guest.
So one possible solution is to  wipe all PMU MSRs (set them to 0) before 
returning the error.

2. __core2_vpmu_load is called in two scenarios - One from a hyper call 
XENPMU_FLUSH from do xenpmu_op()
and the other during a context switch from vpmu_switch_to(). The 
hypercall seems to be handling the error
correctly but vpmu_switch_to() ignores the error. So to propagate the 
error up to the guest kernel maybe we should
inject the fault into the guest using 
hvm_inject_hw_exception(TRAP_gp_fault, error)/ 
pv_inject_hw_exception(TRAP_gp_fault, error)?

> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel


--------------0892E7CBDE0B437DA54D6713
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <br>
    <br>
    <div class="moz-cite-prefix">On 04/21/2017 03:14 AM, Jan Beulich
      wrote:<br>
    </div>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <blockquote type="cite">
        <blockquote type="cite">
          <blockquote type="cite">
            <pre wrap="">On 20.04.17 at 19:49, <a class="moz-txt-link-rfc2396E" href="mailto:mohit.gambhir@oracle.com">&lt;mohit.gambhir@oracle.com&gt;</a> wrote:
</pre>
          </blockquote>
        </blockquote>
        <pre wrap="">This patch changes wrmsrl() calls to write to MSR_P6_EVTSEL register in the
VPMU to wrmsr_safe(). There are known (and possibly some unknown) cases where
setting certain bits in MSR_P6_EVTSEL reg. can cause a General Protection
fault on some machines. Unless we catch this fault when it happens, it will
result in a hypervisor crash.

For instance, setting Pin Control (PC) bit (19) in MSR_P6_EVNTSEL results
in a General Protection Fault on Broadwell machines and thus causes the
hypervisor to crash.
</pre>
      </blockquote>
      <pre wrap="">I can't seem to find indication of this in the SDM. Is this removal of a
valid bit from an architectural MSR documented anywhere?
</pre>
    </blockquote>
    It see it as an observed hardware bug so I don't expect it to be
    found in the SDM.  I reached out to <br>
    the folks at Intel (CC) but they are unaware of it. Ideally it
    should work the way it says in the SDM<br>
    for all events and on all machines but it doesn't - not for the all
    the events and certainly not on all <br>
    machines. <br>
    <br>
    Also, from the description in the SDM, PC flag bit it seems very
    disruptive to me. <br>
    SDM says that if the bit is set then the processor toggles the PMi
    pin (generating a performance monitoring interrupt?) <br>
    every time the event occurs. So if we program the counter to count
    "unhaulted core cycles", and set PC flag bit <br>
    it will generate an interrupts every cycle!? <br>
    <br>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <blockquote type="cite">
        <pre wrap="">@@ -356,7 +356,9 @@ static inline void __core2_vpmu_load(struct vcpu *v)
     for ( i = 0; i &lt; arch_pmc_cnt; i++ )
     {
         wrmsrl(pmc_start + i, xen_pmu_cntr_pair[i].counter);
-        wrmsrl(MSR_P6_EVNTSEL(i), xen_pmu_cntr_pair[i].control);
+        if ( wrmsr_safe(MSR_P6_EVNTSEL(i), xen_pmu_cntr_pair[i].control) ==
+                -EFAULT)
+            return -EFAULT;
</pre>
      </blockquote>
      <pre wrap="">If the function already returns an error code, please use it to avoid
a latent bug when in the future it may also return other than -EFAULT.
</pre>
    </blockquote>
    Fixed in next version. <br>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <blockquote type="cite">
        <pre wrap="">@@ -369,6 +371,7 @@ static inline void __core2_vpmu_load(struct vcpu *v)
         core2_vpmu_cxt-&gt;global_ovf_ctrl = 0;
         wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, core2_vpmu_cxt-&gt;global_ctrl);
     }
+    return 0;
</pre>
      </blockquote>
      <pre wrap="">Blank line above the main return statement of a function please.
pv_inject_hw_exception(TRAP_gp_fault</pre>
    </blockquote>
    Fixed in next version. <br>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <blockquote type="cite">
        <pre wrap="">@@ -461,9 +464,8 @@ static int core2_vpmu_load(struct vcpu *v, bool_t 
from_guest)
 
     vpmu_set(vpmu, VPMU_CONTEXT_LOADED);
 
-    __core2_vpmu_load(v);
+    return __core2_vpmu_load(v);
 
-    return 0;
 }
</pre>
      </blockquote>
      <pre wrap="">Please also remove the now stray blank line.

</pre>
    </blockquote>
    Fixed in next version.<br>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <blockquote type="cite">
        <pre wrap="">@@ -538,7 +540,8 @@ static int core2_vpmu_msr_common_check(u32 msr_index, int *type, int *index)
     /* Do the lazy load staff. */
     if ( !vpmu_is_set(vpmu, VPMU_CONTEXT_LOADED) )
     {
-        __core2_vpmu_load(current);
+        if ( __core2_vpmu_load(current) )
+            return 0;
</pre>
      </blockquote>
      <pre wrap="">Nothing you need to change, but the change here point out that
the function really needs switching to returning bool.
</pre>
    </blockquote>
    Agreed. <br>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <blockquote type="cite">
        <pre wrap="">@@ -719,8 +722,11 @@ static int core2_vpmu_do_wrmsr(unsigned int msr, uint64_t msr_content,
         }
     }
 
-    if ( type != MSR_TYPE_GLOBAL )
-        wrmsrl(msr, msr_content);
+    if ( type != MSR_TYPE_GLOBAL)
+    {
+        if ( wrmsr_safe(msr, msr_content) == -EFAULT )
+            return -EFAULT;
</pre>
      </blockquote>
      <pre wrap="">See the earlier similar comment.
</pre>
    </blockquote>
    Fixed in next verison. <br>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <pre wrap="">Further a more general question: Why do you do this only for
MSR_P6_EVNTSEL*, but not consistently for all MSRs which are
being written with (partially) guest controlled values? I'd namely
expect all wrmsrl() to be switched over, unless it is clear that
we're dealing with an erratum here and we therefore can
assume that normal guarantees for architectural MSRs apply to
everything else.
</pre>
    </blockquote>
    I think it is an erratum for one specific MSR which is MSR_P6_EVTSEL
    <br>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <pre wrap="">And finally, in the description you talk about forwarding the
fault - I can see this being the case for core2_vpmu_do_wrmsr(),
but how does this match up with __core2_vpmu_load() and its
callers? Namely you may report an error there after already
having updated some MSRs. Leaving the overall set of MSRs in
an inconsistent state is not very likely to be a good idea.
</pre>
    </blockquote>
    Yes, that's something I did not think about and needs to be fixed.
    There are two issues here - <br>
    <br>
    1. What should we set MSR values in case of an error? We can not
    revert them because if it is part of a <br>
    context switch  then we do not want to leak the values from the
    previous guest to the new guest. <br>
    So one possible solution is to  wipe all PMU MSRs (set them to 0)
    before returning the error. <br>
    <br>
    2. __core2_vpmu_load is called in two scenarios - One from a hyper
    call XENPMU_FLUSH from do xenpmu_op() <br>
    and the other during a context switch from vpmu_switch_to(). The
    hypercall seems to be handling the error <br>
    correctly but vpmu_switch_to() ignores the error. So to propagate
    the error up to the guest kernel maybe we should<br>
    inject the fault into the guest using 
    hvm_inject_hw_exception(TRAP_gp_fault, error)/
    pv_inject_hw_exception(TRAP_gp_fault, error)?<br>
    <br>
    <blockquote
      cite="mid:58F9CD740200007800152B3E@prv-mh.provo.novell.com"
      type="cite">
      <pre wrap="">
Jan


_______________________________________________
Xen-devel mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Xen-devel@lists.xen.org">Xen-devel@lists.xen.org</a>
<a class="moz-txt-link-freetext" href="https://lists.xen.org/xen-devel">https://lists.xen.org/xen-devel</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>

--------------0892E7CBDE0B437DA54D6713--


--===============3744714304666097054==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs
IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwczovL2xpc3RzLnhlbi5v
cmcveGVuLWRldmVsCg==

--===============3744714304666097054==--