linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* x86_64 INIT/SIPI Bug
@ 2018-11-08 22:23 Rian Quinn
  2018-11-09  1:16 ` Eric W. Biederman
  2018-11-09 17:49 ` Sean Christopherson
  0 siblings, 2 replies; 6+ messages in thread
From: Rian Quinn @ 2018-11-08 22:23 UTC (permalink / raw)
  To: linux-kernel

I apologize upfront if this is the wrong place to post this, pretty new to this.

We are working on the Bareflank Hypervisor (www.bareflank.org), and we
are passing through the INIT/SIPI process (similar to how a VMX
rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
the INIT/SIPI process stalls, something we are not seeing on Ubuntu.

Turns out, to fix the issue, we had to turn on cpu_init_udelay=10000.
The problem is, once a hypervisor is turned on, even one that is doing
nothing but passing through the instructions, the "quick" that is
detailed below fails as the kernel is not giving the CPU enough time
to perform a VMExit/VMEntry (even through the VMExit is not doing
anything).
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/smpboot.c?h=v4.20-rc1#n650

You can see our INIT/SIPI code here if you are interested:
https://github.com/rianquinn/extended_apis/blob/hyperkernel_1/bfvmm/src/hve/arch/intel_x64/vmexit/init_signal.cpp

The reason I suggest this is a bug is the manual clearly states that a
wait is required and the "quirk" that turns off this delay prevents
code like this from working. Would it be possible to either:
- Turn this off by default, but still allow someone to turn it on if
they are confident the delay is not needed?
- Provide a generic way to turn this off (maybe if a hypervisor is
detected, it defaults to off)?

I'd be more than happy to provide a patch and test, but I'm not sure
if there is any interest in changing this code.

Thanks,
- Rian Quinn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: x86_64 INIT/SIPI Bug
  2018-11-08 22:23 x86_64 INIT/SIPI Bug Rian Quinn
@ 2018-11-09  1:16 ` Eric W. Biederman
  2018-11-09 16:13   ` Rian Quinn
  2018-11-09 17:49 ` Sean Christopherson
  1 sibling, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2018-11-09  1:16 UTC (permalink / raw)
  To: Rian Quinn; +Cc: linux-kernel, x86

Rian Quinn <rianquinn@gmail.com> writes:

> I apologize upfront if this is the wrong place to post this, pretty new to this.
>
> We are working on the Bareflank Hypervisor (www.bareflank.org), and we
> are passing through the INIT/SIPI process (similar to how a VMX
> rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
> the INIT/SIPI process stalls, something we are not seeing on Ubuntu.
>
> Turns out, to fix the issue, we had to turn on cpu_init_udelay=10000.
> The problem is, once a hypervisor is turned on, even one that is doing
> nothing but passing through the instructions, the "quick" that is
> detailed below fails as the kernel is not giving the CPU enough time
> to perform a VMExit/VMEntry (even through the VMExit is not doing
> anything).
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/smpboot.c?h=v4.20-rc1#n650
>
> You can see our INIT/SIPI code here if you are interested:
> https://github.com/rianquinn/extended_apis/blob/hyperkernel_1/bfvmm/src/hve/arch/intel_x64/vmexit/init_signal.cpp
>
> The reason I suggest this is a bug is the manual clearly states that a
> wait is required and the "quirk" that turns off this delay prevents
> code like this from working. Would it be possible to either:
> - Turn this off by default, but still allow someone to turn it on if
> they are confident the delay is not needed?
> - Provide a generic way to turn this off (maybe if a hypervisor is
> detected, it defaults to off)?
>
> I'd be more than happy to provide a patch and test, but I'm not sure
> if there is any interest in changing this code.

I would suggest testing either for your hypervisor or simply for being
inside some hypervisor.  As I read the code it is only turning off the
10ms delay on processors where it is know that it is safe to do so.
This is a case where it is not safe to disable the 10ms delay,
so it makes sense not not turn off the delay.

Eric


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: x86_64 INIT/SIPI Bug
  2018-11-09  1:16 ` Eric W. Biederman
@ 2018-11-09 16:13   ` Rian Quinn
  0 siblings, 0 replies; 6+ messages in thread
From: Rian Quinn @ 2018-11-09 16:13 UTC (permalink / raw)
  To: ebiederm; +Cc: linux-kernel, x86

>> I apologize upfront if this is the wrong place to post this, pretty new to this.
>>
>> We are working on the Bareflank Hypervisor (www.bareflank.org), and we
>> are passing through the INIT/SIPI process (similar to how a VMX
>> rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
>> the INIT/SIPI process stalls, something we are not seeing on Ubuntu.
>>
>> Turns out, to fix the issue, we had to turn on cpu_init_udelay=10000.
>> The problem is, once a hypervisor is turned on, even one that is doing
>> nothing but passing through the instructions, the "quick" that is
>> detailed below fails as the kernel is not giving the CPU enough time
>> to perform a VMExit/VMEntry (even through the VMExit is not doing
>> anything).
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/smpboot.c?h=v4.20-rc1#n650
>>
>> You can see our INIT/SIPI code here if you are interested:
>> https://github.com/rianquinn/extended_apis/blob/hyperkernel_1/bfvmm/src/hve/arch/intel_x64/vmexit/init_signal.cpp
>>
>> The reason I suggest this is a bug is the manual clearly states that a
>> wait is required and the "quirk" that turns off this delay prevents
>> code like this from working. Would it be possible to either:
>> - Turn this off by default, but still allow someone to turn it on if
>> they are confident the delay is not needed?
>> - Provide a generic way to turn this off (maybe if a hypervisor is
>> detected, it defaults to off)?
>>
>> I'd be more than happy to provide a patch and test, but I'm not sure
>> if there is any interest in changing this code.
>
> I would suggest testing either for your hypervisor or simply for being
> inside some hypervisor.  As I read the code it is only turning off the
> 10ms delay on processors where it is know that it is safe to do so.
> This is a case where it is not safe to disable the 10ms delay,
> so it makes sense not not turn off the delay.

I think the best solution would be to simply check CPUID leaf 0x40000000
as every hypervisor (at least the legit ones) set this leaf to a unique
ID that can be used to see if a hypervisor is running. Specifically, I
would patch the smp_quirk_init_udelay() function to check to see if
CPUID 0x40000000 returns 0. If it does, no hypervisor is running in which case
the quirk can set the delay to 0. If the leaf is not 0, the quirk is ignored
and the delay is set to UDELAY_10MS_DEFAULT as it does already.

Thoughts? I think the patch would look something like:

    static void __init smp_quirk_init_udelay(void)
    {
+       unsigned int eax, ebx, ecx, edx;
+
+    eax = 0x40000000;
+    ecx = 0;
+    native_cpuid(&eax, &ebx, &ecx, &edx);
+
        /* if cmdline changed it from default, leave it alone */
        if (init_udelay != UINT_MAX)
            return;

+       /* ensure a hypervisor is not running */
+       if (ebx == 0) {
+           /* if modern processor, use no delay */
+           if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
(boot_cpu_data.x86 == 6)) ||
+               ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) &&
(boot_cpu_data.x86 >= 0xF))) {
+               init_udelay = 0;
+               return;
+           }
+       }
+
-       /* if modern processor, use no delay */
-       if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
(boot_cpu_data.x86 == 6)) ||
-           ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) &&
(boot_cpu_data.x86 >= 0xF))) {
-           init_udelay = 0;
-           return;
-       }
        /* else, use legacy delay */
        init_udelay = UDELAY_10MS_DEFAULT;
    }

On Thu, Nov 8, 2018 at 6:16 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Rian Quinn <rianquinn@gmail.com> writes:
>
> > I apologize upfront if this is the wrong place to post this, pretty new to this.
> >
> > We are working on the Bareflank Hypervisor (www.bareflank.org), and we
> > are passing through the INIT/SIPI process (similar to how a VMX
> > rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
> > the INIT/SIPI process stalls, something we are not seeing on Ubuntu.
> >
> > Turns out, to fix the issue, we had to turn on cpu_init_udelay=10000.
> > The problem is, once a hypervisor is turned on, even one that is doing
> > nothing but passing through the instructions, the "quick" that is
> > detailed below fails as the kernel is not giving the CPU enough time
> > to perform a VMExit/VMEntry (even through the VMExit is not doing
> > anything).
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/smpboot.c?h=v4.20-rc1#n650
> >
> > You can see our INIT/SIPI code here if you are interested:
> > https://github.com/rianquinn/extended_apis/blob/hyperkernel_1/bfvmm/src/hve/arch/intel_x64/vmexit/init_signal.cpp
> >
> > The reason I suggest this is a bug is the manual clearly states that a
> > wait is required and the "quirk" that turns off this delay prevents
> > code like this from working. Would it be possible to either:
> > - Turn this off by default, but still allow someone to turn it on if
> > they are confident the delay is not needed?
> > - Provide a generic way to turn this off (maybe if a hypervisor is
> > detected, it defaults to off)?
> >
> > I'd be more than happy to provide a patch and test, but I'm not sure
> > if there is any interest in changing this code.
>
> I would suggest testing either for your hypervisor or simply for being
> inside some hypervisor.  As I read the code it is only turning off the
> 10ms delay on processors where it is know that it is safe to do so.
> This is a case where it is not safe to disable the 10ms delay,
> so it makes sense not not turn off the delay.
>
> Eric
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: x86_64 INIT/SIPI Bug
  2018-11-08 22:23 x86_64 INIT/SIPI Bug Rian Quinn
  2018-11-09  1:16 ` Eric W. Biederman
@ 2018-11-09 17:49 ` Sean Christopherson
  2018-11-09 18:04   ` Rian Quinn
  1 sibling, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2018-11-09 17:49 UTC (permalink / raw)
  To: Rian Quinn; +Cc: linux-kernel

On Thu, Nov 08, 2018 at 03:23:59PM -0700, Rian Quinn wrote:
> I apologize upfront if this is the wrong place to post this, pretty new to this.
> 
> We are working on the Bareflank Hypervisor (www.bareflank.org), and we
> are passing through the INIT/SIPI process (similar to how a VMX
> rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
> the INIT/SIPI process stalls, something we are not seeing on Ubuntu.

I'm confused, INIT is blocked post-VMXON, what are you passing through?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: x86_64 INIT/SIPI Bug
  2018-11-09 17:49 ` Sean Christopherson
@ 2018-11-09 18:04   ` Rian Quinn
  2018-11-09 18:42     ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: Rian Quinn @ 2018-11-09 18:04 UTC (permalink / raw)
  To: sean.j.christopherson; +Cc: linux-kernel

>> I apologize upfront if this is the wrong place to post this, pretty new to this.
>>
>> We are working on the Bareflank Hypervisor (www.bareflank.org), and we
>> are passing through the INIT/SIPI process (similar to how a VMX
>> rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
>> the INIT/SIPI process stalls, something we are not seeing on Ubuntu.
>
> I'm confused, INIT is blocked post-VMXON, what are you passing through?

You are correct that INIT will track unconditionally, but all we do is set the
activity state to wait-for-sipi and return back, allowing Linux to continue
its boot process. The code can be seen here:
https://github.com/rianquinn/extended_apis/blob/hyperkernel_1/bfvmm/src/hve/arch/intel_x64/vmexit/init_signal.cpp

It should be noted that this works great for Linux and Windows, allowing us
to boot pretty much any OS that we want with the hypervisor running
(kind of like a VMX rootkit as no traps are really occurring except CPUID
once the OS is loaded). We are doing something very similar to Intel's KGT,
and Xen's PVH dom0.

The problem is, as we started working this we noticed that Ubuntu was booting
fine, but Arch wasn't and it turns out that Arch must be compiling the kernel
with this optimization enabled. Once it is enabled, the kernel basically
sends two SIPI commands before the AP has a chance to trap INIT, set
the activity
state, and then reenter, which causes both SIPIs to get dropped by hardware
. In other words, since Linux is not waiting to send the first SIPI
like the manual states, the SIPIs are lost if a hypervisor is enabled, even
if the hypervisor is doing the least possible amount of code (just setting the
activity state and returning). The working solution is either us a Linux
distribution that disables this optimization like Ubuntu, or to provide the
Linux kernel with the boot param to tell it to add the delay.

To root of the issue is the quirk is assuming that the CPU can handle the
INIT/SIPI/SIPI without a delay, but this assumption doesn't hold if the INIT
first has to trap to a hypervisor (regardless of the hypervisor). In this case,
a delay is still needed.
On Fri, Nov 9, 2018 at 10:49 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Nov 08, 2018 at 03:23:59PM -0700, Rian Quinn wrote:
> > I apologize upfront if this is the wrong place to post this, pretty new to this.
> >
> > We are working on the Bareflank Hypervisor (www.bareflank.org), and we
> > are passing through the INIT/SIPI process (similar to how a VMX
> > rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
> > the INIT/SIPI process stalls, something we are not seeing on Ubuntu.
>
> I'm confused, INIT is blocked post-VMXON, what are you passing through?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: x86_64 INIT/SIPI Bug
  2018-11-09 18:04   ` Rian Quinn
@ 2018-11-09 18:42     ` Sean Christopherson
  0 siblings, 0 replies; 6+ messages in thread
From: Sean Christopherson @ 2018-11-09 18:42 UTC (permalink / raw)
  To: Rian Quinn; +Cc: linux-kernel

On Fri, Nov 09, 2018 at 11:04:59AM -0700, Rian Quinn wrote:
> >> I apologize upfront if this is the wrong place to post this, pretty new to this.
> >>
> >> We are working on the Bareflank Hypervisor (www.bareflank.org), and we
> >> are passing through the INIT/SIPI process (similar to how a VMX
> >> rootkit from EFI might boot the OS) and we noticed that on Arch Linux,
> >> the INIT/SIPI process stalls, something we are not seeing on Ubuntu.
> >
> > I'm confused, INIT is blocked post-VMXON, what are you passing through?
> 
> You are correct that INIT will track unconditionally, but all we do is set the
> activity state to wait-for-sipi and return back, allowing Linux to continue
> its boot process.

That's not pass-through, maybe call it reflection?  I realize I'm being
a bit pedantic, but differentiating between the two matters since true
pass-through gives you bare metal performance whereas reflection obviously
requires a round-trip VMX transition.

Most hypervisors don't need a delay because they don't pass-through the
local APIC and instead emulate INIT/SIPI/SIPI.  In other words, forcing
a delay for all hypervisors is unwarranted.

The correct fix is probably to add a new hook to struct x86_hyper_init
to provide a custom init delay, and add Bareflank as a new hypervisor.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-11-09 18:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08 22:23 x86_64 INIT/SIPI Bug Rian Quinn
2018-11-09  1:16 ` Eric W. Biederman
2018-11-09 16:13   ` Rian Quinn
2018-11-09 17:49 ` Sean Christopherson
2018-11-09 18:04   ` Rian Quinn
2018-11-09 18:42     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).