PROBLEM: Regression of MMU causing guest VM application errors

All of lore.kernel.org
 help / color / mirror / Atom feed

* PROBLEM: Regression of MMU causing guest VM application errors
@ 2019-10-16  4:49 Derek Yerger
  2019-10-16  7:28 ` Paolo Bonzini
  2019-10-16 17:28 ` Alex Williamson
  0 siblings, 2 replies; 19+ messages in thread
From: Derek Yerger @ 2019-10-16  4:49 UTC (permalink / raw)
  To: kvm

In at least Linux 5.2.7 via Fedora, up to 5.2.18, guest OS applications 
repeatedly crash with segfaults. The problem does not occur on 5.1.16.

System is running Fedora 29 with kernel 5.2.18. Guest OS is Windows 10 with an 
AMD Radeon 540 GPU passthrough. When on 5.2.7 or 5.2.18, specific windows 
applications frequently and repeatedly crash, throwing exceptions in random 
libraries. Going back to 5.1.16, the issue does not occur.

The host system is unaffected by the regression.

Keywords: kvm mmu pci passthrough vfio vfio-pci amdgpu

Possibly related: Unmerged [PATCH] KVM: x86/MMU: Zap all when removing memslot 
if VM has assigned device

Workaround: Use 5.1.16 kernel.

|
|

-- 
Derek Yerger

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-16  4:49 PROBLEM: Regression of MMU causing guest VM application errors Derek Yerger
@ 2019-10-16  7:28 ` Paolo Bonzini
  2019-10-16 17:28 ` Alex Williamson
  1 sibling, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2019-10-16  7:28 UTC (permalink / raw)
  To: Derek Yerger, kvm

On 16/10/19 06:49, Derek Yerger wrote:
> In at least Linux 5.2.7 via Fedora, up to 5.2.18, guest OS applications
> repeatedly crash with segfaults. The problem does not occur on 5.1.16.
> 
> System is running Fedora 29 with kernel 5.2.18. Guest OS is Windows 10
> with an AMD Radeon 540 GPU passthrough. When on 5.2.7 or 5.2.18,
> specific windows applications frequently and repeatedly crash, throwing
> exceptions in random libraries. Going back to 5.1.16, the issue does not
> occur.
> 
> The host system is unaffected by the regression.
> 
> Keywords: kvm mmu pci passthrough vfio vfio-pci amdgpu
> 
> Possibly related: Unmerged [PATCH] KVM: x86/MMU: Zap all when removing
> memslot if VM has assigned device
> 
> Workaround: Use 5.1.16 kernel.

This should have been fixed in 5.2.16 through the following patches:

- "[x86] Revert "KVM: x86/mmu: Zap only the relevant pages when removing
a memslot" (5.2.11)

- "KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot"
(5.2.16).

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-16  4:49 PROBLEM: Regression of MMU causing guest VM application errors Derek Yerger
  2019-10-16  7:28 ` Paolo Bonzini
@ 2019-10-16 17:28 ` Alex Williamson
  2019-10-16 17:49   ` Sean Christopherson
  1 sibling, 1 reply; 19+ messages in thread
From: Alex Williamson @ 2019-10-16 17:28 UTC (permalink / raw)
  To: Derek Yerger; +Cc: kvm, sean.j.christopherson, Bonzini, Paolo

On Wed, 16 Oct 2019 00:49:51 -0400
Derek Yerger <derek@djy.llc> wrote:

> In at least Linux 5.2.7 via Fedora, up to 5.2.18, guest OS applications 
> repeatedly crash with segfaults. The problem does not occur on 5.1.16.
> 
> System is running Fedora 29 with kernel 5.2.18. Guest OS is Windows 10 with an 
> AMD Radeon 540 GPU passthrough. When on 5.2.7 or 5.2.18, specific windows 
> applications frequently and repeatedly crash, throwing exceptions in random 
> libraries. Going back to 5.1.16, the issue does not occur.
> 
> The host system is unaffected by the regression.
> 
> Keywords: kvm mmu pci passthrough vfio vfio-pci amdgpu
> 
> Possibly related: Unmerged [PATCH] KVM: x86/MMU: Zap all when removing memslot 
> if VM has assigned device

That was never merged because it was superseded by:

d012a06ab1d2 Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"

That revert also induced this commit:

002c5f73c508 KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot

Both of these were merged to stable, showing up in 5.2.11 and 5.2.16
respectively, so seeing these sorts of issues might be considered a
known issue on 5.2.7, but not 5.2.18 afaik.  Do you have a specific
test that reliably reproduces the issue?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-16 17:28 ` Alex Williamson
@ 2019-10-16 17:49   ` Sean Christopherson
  2019-10-17 23:57     ` Derek Yerger
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-10-16 17:49 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Derek Yerger, kvm, Bonzini, Paolo

On Wed, Oct 16, 2019 at 11:28:57AM -0600, Alex Williamson wrote:
> On Wed, 16 Oct 2019 00:49:51 -0400
> Derek Yerger <derek@djy.llc> wrote:
> 
> > In at least Linux 5.2.7 via Fedora, up to 5.2.18, guest OS applications 
> > repeatedly crash with segfaults. The problem does not occur on 5.1.16.
> > 
> > System is running Fedora 29 with kernel 5.2.18. Guest OS is Windows 10 with an 
> > AMD Radeon 540 GPU passthrough. When on 5.2.7 or 5.2.18, specific windows 
> > applications frequently and repeatedly crash, throwing exceptions in random 
> > libraries. Going back to 5.1.16, the issue does not occur.
> > 
> > The host system is unaffected by the regression.
> > 
> > Keywords: kvm mmu pci passthrough vfio vfio-pci amdgpu
> > 
> > Possibly related: Unmerged [PATCH] KVM: x86/MMU: Zap all when removing memslot 
> > if VM has assigned device
> 
> That was never merged because it was superseded by:
> 
> d012a06ab1d2 Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"
> 
> That revert also induced this commit:
> 
> 002c5f73c508 KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot
> 
> Both of these were merged to stable, showing up in 5.2.11 and 5.2.16
> respectively, so seeing these sorts of issues might be considered a
> known issue on 5.2.7, but not 5.2.18 afaik.  Do you have a specific
> test that reliably reproduces the issue?  Thanks,

Also, does the failure reproduce on on 5.2.1 - 5.2.6?  The memslot debacle
exists on all flavors of 5.2.x, if the errors showed up in 5.2.7 then they
are being caused by something else.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-16 17:49   ` Sean Christopherson
@ 2019-10-17 23:57     ` Derek Yerger
  2019-10-22 20:28       ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Derek Yerger @ 2019-10-17 23:57 UTC (permalink / raw)
  To: Sean Christopherson, Alex Williamson; +Cc: kvm, Bonzini, Paolo

On 10/16/19 1:49 PM, Sean Christopherson wrote:
> On Wed, Oct 16, 2019 at 11:28:57AM -0600, Alex Williamson wrote:
>> On Wed, 16 Oct 2019 00:49:51 -0400
>> Derek Yerger<derek@djy.llc>  wrote:
>>
>>> In at least Linux 5.2.7 via Fedora, up to 5.2.18, guest OS applications
>>> repeatedly crash with segfaults. The problem does not occur on 5.1.16.
>>>
>>> System is running Fedora 29 with kernel 5.2.18. Guest OS is Windows 10 with an
>>> AMD Radeon 540 GPU passthrough. When on 5.2.7 or 5.2.18, specific windows
>>> applications frequently and repeatedly crash, throwing exceptions in random
>>> libraries. Going back to 5.1.16, the issue does not occur.
>>>
>>> The host system is unaffected by the regression.
>>>
>>> Keywords: kvm mmu pci passthrough vfio vfio-pci amdgpu
>>>
>>> Possibly related: Unmerged [PATCH] KVM: x86/MMU: Zap all when removing memslot
>>> if VM has assigned device
>> That was never merged because it was superseded by:
>>
>> d012a06ab1d2 Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"
>>
>> That revert also induced this commit:
>>
>> 002c5f73c508 KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot
>>
>> Both of these were merged to stable, showing up in 5.2.11 and 5.2.16
>> respectively, so seeing these sorts of issues might be considered a
>> known issue on 5.2.7, but not 5.2.18 afaik.  Do you have a specific
>> test that reliably reproduces the issue?  Thanks,
Test case 1: Kernel 5.2.18, PCI passthrough, Windows 10 guest, error condition.
Error 1: Application error in Firefox, restarting firefox and restoring tabs 
reliably causes application crash with stack overflow error.
Error 2: Guest BSOD by the morning if left idle
Error 3: Guest BSOD within 1 minute of using SolidWorks CAD software

Test case 2: Kernel 5.2.18, no PCI passthrough, same environment. Guest BSOD 
encountered.

Test case 3: Kernel 5.1.16, no PCI passthrough, same environment. Worked in 
Solidworks for 10 minutes without BSOD. Opened firefox and restored tabs, no crash.

Test case 4: Kernel 5.1.16, with PCI passthrough, same environment. Worked in 
Solidworks for a half hour. Opened firefox and restored tabs, no crash.

Other factors: The guest does not change between tests. Same drivers, software, 
etc. I have reliably switched between 5.2.x and 5.1.x multiple times in the past 
month and repeatably see issues with 5.2.x. At this point I'm unsure if it's PCI 
passthrough causing the problem.

I know I should probably start from fresh host and guest, but time isn't really 
permitting.
> Also, does the failure reproduce on on 5.2.1 - 5.2.6?  The memslot debacle
> exists on all flavors of 5.2.x, if the errors showed up in 5.2.7 then they
> are being caused by something else.
After experiencing the issue in absence of PCI passthrough, I believe the 
problem is unrelated to the memslot debacle. I'm stuck on 5.1.x for now, maybe 
I'll give up and get a dedicated windows machine /s

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-17 23:57     ` Derek Yerger
@ 2019-10-22 20:28       ` Sean Christopherson
  2019-10-24 15:18         ` Derek Yerger
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-10-22 20:28 UTC (permalink / raw)
  To: Derek Yerger; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On Thu, Oct 17, 2019 at 07:57:35PM -0400, Derek Yerger wrote:
> On 10/16/19 1:49 PM, Sean Christopherson wrote:
> >On Wed, Oct 16, 2019 at 11:28:57AM -0600, Alex Williamson wrote:
> >>On Wed, 16 Oct 2019 00:49:51 -0400
> >>Derek Yerger<derek@djy.llc>  wrote:
> >>
> >>>In at least Linux 5.2.7 via Fedora, up to 5.2.18, guest OS applications
> >>>repeatedly crash with segfaults. The problem does not occur on 5.1.16.
> >>>
> >>>System is running Fedora 29 with kernel 5.2.18. Guest OS is Windows 10 with an
> >>>AMD Radeon 540 GPU passthrough. When on 5.2.7 or 5.2.18, specific windows
> >>>applications frequently and repeatedly crash, throwing exceptions in random
> >>>libraries. Going back to 5.1.16, the issue does not occur.
> >>>
> >>>The host system is unaffected by the regression.
> >>>
> >>>Keywords: kvm mmu pci passthrough vfio vfio-pci amdgpu
> >>>
> >>>Possibly related: Unmerged [PATCH] KVM: x86/MMU: Zap all when removing memslot
> >>>if VM has assigned device
> >>That was never merged because it was superseded by:
> >>
> >>d012a06ab1d2 Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"
> >>
> >>That revert also induced this commit:
> >>
> >>002c5f73c508 KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot
> >>
> >>Both of these were merged to stable, showing up in 5.2.11 and 5.2.16
> >>respectively, so seeing these sorts of issues might be considered a
> >>known issue on 5.2.7, but not 5.2.18 afaik.  Do you have a specific
> >>test that reliably reproduces the issue?  Thanks,
> Test case 1: Kernel 5.2.18, PCI passthrough, Windows 10 guest, error condition.
> Error 1: Application error in Firefox, restarting firefox and restoring tabs
> reliably causes application crash with stack overflow error.
> Error 2: Guest BSOD by the morning if left idle
> Error 3: Guest BSOD within 1 minute of using SolidWorks CAD software
> 
> Test case 2: Kernel 5.2.18, no PCI passthrough, same environment. Guest BSOD
> encountered.
> 
> Test case 3: Kernel 5.1.16, no PCI passthrough, same environment. Worked in
> Solidworks for 10 minutes without BSOD. Opened firefox and restored tabs, no
> crash.
> 
> Test case 4: Kernel 5.1.16, with PCI passthrough, same environment. Worked
> in Solidworks for a half hour. Opened firefox and restored tabs, no crash.
> 
> Other factors: The guest does not change between tests. Same drivers,
> software, etc. I have reliably switched between 5.2.x and 5.1.x multiple
> times in the past month and repeatably see issues with 5.2.x. At this point
> I'm unsure if it's PCI passthrough causing the problem.
> 
> I know I should probably start from fresh host and guest, but time isn't
> really permitting.
> >Also, does the failure reproduce on on 5.2.1 - 5.2.6?  The memslot debacle
> >exists on all flavors of 5.2.x, if the errors showed up in 5.2.7 then they
> >are being caused by something else.
> After experiencing the issue in absence of PCI passthrough, I believe the
> problem is unrelated to the memslot debacle.

Heh, should've checked from the get go...  It's definitely not the memslot
issue, because the memslot bug is in 5.1.16 as well.  :-)

> I'm stuck on 5.1.x for now, maybe I'll give up and get a dedicated windows
> machine /s

What hardware are you running on?  I was thinking this was AMD specific,
but then realized you said "AMD Radeon 540 GPU" and not "AMD CPU".

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-22 20:28       ` Sean Christopherson
@ 2019-10-24 15:18         ` Derek Yerger
  2019-10-24 17:32           ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Derek Yerger @ 2019-10-24 15:18 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On 10/22/19 4:28 PM, Sean Christopherson wrote:
> On Thu, Oct 17, 2019 at 07:57:35PM -0400, Derek Yerger wrote:
>> On 10/16/19 1:49 PM, Sean Christopherson wrote:
>>> On Wed, Oct 16, 2019 at 11:28:57AM -0600, Alex Williamson wrote:
>>>> On Wed, 16 Oct 2019 00:49:51 -0400
>>>> Derek Yerger<derek@djy.llc>  wrote:
>>>>
>>>>> In at least Linux 5.2.7 via Fedora, up to 5.2.18, guest OS applications
>>>>> repeatedly crash with segfaults. The problem does not occur on 5.1.16.
>>>>>
>>>>> System is running Fedora 29 with kernel 5.2.18. Guest OS is Windows 10 with an
>>>>> AMD Radeon 540 GPU passthrough. When on 5.2.7 or 5.2.18, specific windows
>>>>> applications frequently and repeatedly crash, throwing exceptions in random
>>>>> libraries. Going back to 5.1.16, the issue does not occur.
>>>>>
>>>>> The host system is unaffected by the regression.
>>>>>
>>>>> Keywords: kvm mmu pci passthrough vfio vfio-pci amdgpu
>>>>>
>>>>> Possibly related: Unmerged [PATCH] KVM: x86/MMU: Zap all when removing memslot
>>>>> if VM has assigned device
>>>> That was never merged because it was superseded by:
>>>>
>>>> d012a06ab1d2 Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"
>>>>
>>>> That revert also induced this commit:
>>>>
>>>> 002c5f73c508 KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot
>>>>
>>>> Both of these were merged to stable, showing up in 5.2.11 and 5.2.16
>>>> respectively, so seeing these sorts of issues might be considered a
>>>> known issue on 5.2.7, but not 5.2.18 afaik.  Do you have a specific
>>>> test that reliably reproduces the issue?  Thanks,
>> Test case 1: Kernel 5.2.18, PCI passthrough, Windows 10 guest, error condition.
>> Error 1: Application error in Firefox, restarting firefox and restoring tabs
>> reliably causes application crash with stack overflow error.
>> Error 2: Guest BSOD by the morning if left idle
>> Error 3: Guest BSOD within 1 minute of using SolidWorks CAD software
>>
>> Test case 2: Kernel 5.2.18, no PCI passthrough, same environment. Guest BSOD
>> encountered.
>>
>> Test case 3: Kernel 5.1.16, no PCI passthrough, same environment. Worked in
>> Solidworks for 10 minutes without BSOD. Opened firefox and restored tabs, no
>> crash.
>>
>> Test case 4: Kernel 5.1.16, with PCI passthrough, same environment. Worked
>> in Solidworks for a half hour. Opened firefox and restored tabs, no crash.
>>
>> Other factors: The guest does not change between tests. Same drivers,
>> software, etc. I have reliably switched between 5.2.x and 5.1.x multiple
>> times in the past month and repeatably see issues with 5.2.x. At this point
>> I'm unsure if it's PCI passthrough causing the problem.
>>
>> I know I should probably start from fresh host and guest, but time isn't
>> really permitting.
>>> Also, does the failure reproduce on on 5.2.1 - 5.2.6?  The memslot debacle
>>> exists on all flavors of 5.2.x, if the errors showed up in 5.2.7 then they
>>> are being caused by something else.
>> After experiencing the issue in absence of PCI passthrough, I believe the
>> problem is unrelated to the memslot debacle.
> Heh, should've checked from the get go...  It's definitely not the memslot
> issue, because the memslot bug is in 5.1.16 as well.  :-)
I didn't pick up on that, nice catch. The memslot thread was the closest thing I 
could find to an educated guess.
>> I'm stuck on 5.1.x for now, maybe I'll give up and get a dedicated windows
>> machine /s
> What hardware are you running on?  I was thinking this was AMD specific,
> but then realized you said "AMD Radeon 540 GPU" and not "AMD CPU".
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa 
PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7)
         Subsystem: Gigabyte Technology Co., Ltd Device 22fe
         Kernel driver in use: vfio-pci
         Kernel modules: amdgpu
(plus related audio device)

I can't think of any other data points that would be helpful to solving system 
instability in a guest OS. But given my troubleshooting before, it looks like 
presence/absence of a PCI passthrough device is inconsequential to whether the 
problem is occurring.

I may have to try out other VMs or a fresh windows guest.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-24 15:18         ` Derek Yerger
@ 2019-10-24 17:32           ` Sean Christopherson
  2019-10-31  3:44             ` Derek Yerger
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-10-24 17:32 UTC (permalink / raw)
  To: Derek Yerger; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On Thu, Oct 24, 2019 at 11:18:59AM -0400, Derek Yerger wrote:
> On 10/22/19 4:28 PM, Sean Christopherson wrote:
> >On Thu, Oct 17, 2019 at 07:57:35PM -0400, Derek Yerger wrote:
> >Heh, should've checked from the get go...  It's definitely not the memslot
> >issue, because the memslot bug is in 5.1.16 as well.  :-)
> I didn't pick up on that, nice catch. The memslot thread was the closest
> thing I could find to an educated guess.
> >>I'm stuck on 5.1.x for now, maybe I'll give up and get a dedicated windows
> >>machine /s
> >What hardware are you running on?  I was thinking this was AMD specific,
> >but then realized you said "AMD Radeon 540 GPU" and not "AMD CPU".
> Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
> 
> 07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7)
>         Subsystem: Gigabyte Technology Co., Ltd Device 22fe
>         Kernel driver in use: vfio-pci
>         Kernel modules: amdgpu
> (plus related audio device)
> 
> I can't think of any other data points that would be helpful to solving
> system instability in a guest OS.

Can you bisect starting from v5.2?  Identifying which commit in the kernel
introduced the regression would help immensely.

> But given my troubleshooting before, it
> looks like presence/absence of a PCI passthrough device is inconsequential
> to whether the problem is occurring.
> 
> I may have to try out other VMs or a fresh windows guest.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-24 17:32           ` Sean Christopherson
@ 2019-10-31  3:44             ` Derek Yerger
  2019-11-19 20:01               ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Derek Yerger @ 2019-10-31  3:44 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Alex Williamson, kvm, Bonzini, Paolo


On 10/24/19 1:32 PM, Sean Christopherson wrote:
> On Thu, Oct 24, 2019 at 11:18:59AM -0400, Derek Yerger wrote:
>> On 10/22/19 4:28 PM, Sean Christopherson wrote:
>>> On Thu, Oct 17, 2019 at 07:57:35PM -0400, Derek Yerger wrote:
>>> Heh, should've checked from the get go...  It's definitely not the memslot
>>> issue, because the memslot bug is in 5.1.16 as well.  :-)
>> I didn't pick up on that, nice catch. The memslot thread was the closest
>> thing I could find to an educated guess.
>>>> I'm stuck on 5.1.x for now, maybe I'll give up and get a dedicated windows
>>>> machine /s
>>> What hardware are you running on?  I was thinking this was AMD specific,
>>> but then realized you said "AMD Radeon 540 GPU" and not "AMD CPU".
>> Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
>>
>> 07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
>> Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7)
>>          Subsystem: Gigabyte Technology Co., Ltd Device 22fe
>>          Kernel driver in use: vfio-pci
>>          Kernel modules: amdgpu
>> (plus related audio device)
>>
>> I can't think of any other data points that would be helpful to solving
>> system instability in a guest OS.
> Can you bisect starting from v5.2?  Identifying which commit in the kernel
> introduced the regression would help immensely.
On the host, I have to install NVIDIA GPU drivers with each new kernel build. 
During the process I discovered that I can't reproduce the issue on any kernel 
if I skip the *host* GPU drivers and start libvirtd in single mode.

I noticed the following in the host kernel log around the time the guest 
encountered BSOD on 5.2.7:

[  337.841491] WARNING: CPU: 6 PID: 7548 at arch/x86/kvm/x86.c:7963 
kvm_arch_vcpu_ioctl_run+0x19b1/0x1b00 [kvm]

I have the rest of the log available if it's needed.

Otherwise the bisection process is: Build/install/run kernel, install host GPU 
drivers, exit single mode, start virt-manager, and do a few things in the guest 
until a crash occurs.

I swapped between Fedora distribution kernel 5.2.7 and 5.1.16 to be sure my test 
was reliably working between good/bad. I then built from tag v5.2.7 and 
confirmed the issue was present. The test failure is indicated by one of BSOD, 
Firefox crash, or tab crash, and reliably happens on the problem kernel but not 
on the good one.

After about 10 steps into bisecting, my tests became less reliable to the point 
that I'm not sure whether to mark my current point @381dc73f as good or bad. I 
had one crash but have been using the guest otherwise reliably for a few days. 
Considering the time it takes to build, install, and test, I didn't want to go 
too far down the wrong path if my tests are unreliable (even though 5.2.7 is a 
guaranteed and timely failure). I'll probably pick it back up over the weekend.

In any event, here is the bisect log up to now:

git bisect start
# bad: [5697a9d3d55fad99ffc3c1ba5654426ab64df333] Linux 5.2.7
git bisect bad 5697a9d3d55fad99ffc3c1ba5654426ab64df333
# good: [8584aaf1c3262ca17d1e4a614ede9179ef462bb0] Linux 5.1.16
git bisect good 8584aaf1c3262ca17d1e4a614ede9179ef462bb0
# good: [e93c9c99a629c61837d5a7fc2120cd2b6c70dbdd] Linux 5.1
git bisect good e93c9c99a629c61837d5a7fc2120cd2b6c70dbdd
# skip: [a2d635decbfa9c1e4ae15cb05b68b2559f7f827c] Merge tag 
'drm-next-2019-05-09' of git://anongit.freedesktop.org/drm/drm
git bisect skip a2d635decbfa9c1e4ae15cb05b68b2559f7f827c
# good: [ee8146aad87cd8eeb5963856ac0b9a9176392e3a] coresight: 
dynamic-replicator: Clean up error handling
git bisect good ee8146aad87cd8eeb5963856ac0b9a9176392e3a
# good: [2e1f164861e500f4e068a9d909bbd3fcc7841483] net: hns: Fix loopback test 
failed at copper ports
git bisect good 2e1f164861e500f4e068a9d909bbd3fcc7841483
# good: [c884d8ac7ffccc094e9674a3eb3be90d3b296c0a] Merge tag 'spdx-5.2-rc6' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
git bisect good c884d8ac7ffccc094e9674a3eb3be90d3b296c0a
# bad: [1ba0d730c0ca6825225171b74721bc75f3d12da8] bcache: fix potential deadlock 
in cached_def_free()
git bisect bad 1ba0d730c0ca6825225171b74721bc75f3d12da8
# good: [a5fff14a0c7989fbc8316a43f52aed1804f02ddd] Merge branch 'akpm' (patches 
from Andrew)
git bisect good a5fff14a0c7989fbc8316a43f52aed1804f02ddd
# good: [42db12d5cd081964e1844dad1f5f4088921fd303] ice: Gracefully handle reset 
failure in ice_alloc_vfs()
git bisect good 42db12d5cd081964e1844dad1f5f4088921fd303
# good: [161c926ba6f0bb779c0fb860d3cf390eb314d345] perf/x86/intel: Add more 
Icelake CPUIDs
git bisect good 161c926ba6f0bb779c0fb860d3cf390eb314d345
# good: [9a9ff8f128445688f43b9afc1b837a3de4548586] media: coda: increment 
sequence offset for the last returned frame
git bisect good 9a9ff8f128445688f43b9afc1b837a3de4548586
# good: [381dc73f8216252904d6578d7229282029aa430d] netfilter: ctnetlink: Fix 
regression in conntrack entry deletion
git bisect good 381dc73f8216252904d6578d7229282029aa430d

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-10-31  3:44             ` Derek Yerger
@ 2019-11-19 20:01               ` Sean Christopherson
  2019-11-20  9:19                 ` Wanpeng Li
  2019-11-20 18:19                 ` Sean Christopherson
  0 siblings, 2 replies; 19+ messages in thread
From: Sean Christopherson @ 2019-11-19 20:01 UTC (permalink / raw)
  To: Derek Yerger; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On Wed, Oct 30, 2019 at 11:44:09PM -0400, Derek Yerger wrote:
> 
> On 10/24/19 1:32 PM, Sean Christopherson wrote:
> >On Thu, Oct 24, 2019 at 11:18:59AM -0400, Derek Yerger wrote:
> >>On 10/22/19 4:28 PM, Sean Christopherson wrote:
> >>>On Thu, Oct 17, 2019 at 07:57:35PM -0400, Derek Yerger wrote:
> >>>Heh, should've checked from the get go...  It's definitely not the memslot
> >>>issue, because the memslot bug is in 5.1.16 as well.  :-)
> >>I didn't pick up on that, nice catch. The memslot thread was the closest
> >>thing I could find to an educated guess.
> >>>>I'm stuck on 5.1.x for now, maybe I'll give up and get a dedicated windows
> >>>>machine /s
> >>>What hardware are you running on?  I was thinking this was AMD specific,
> >>>but then realized you said "AMD Radeon 540 GPU" and not "AMD CPU".
> >>Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
> >>
> >>07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> >>Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7)
> >>         Subsystem: Gigabyte Technology Co., Ltd Device 22fe
> >>         Kernel driver in use: vfio-pci
> >>         Kernel modules: amdgpu
> >>(plus related audio device)
> >>
> >>I can't think of any other data points that would be helpful to solving
> >>system instability in a guest OS.
> >Can you bisect starting from v5.2?  Identifying which commit in the kernel
> >introduced the regression would help immensely.
> On the host, I have to install NVIDIA GPU drivers with each new kernel
> build. During the process I discovered that I can't reproduce the issue on
> any kernel if I skip the *host* GPU drivers and start libvirtd in single
> mode.
> 
> I noticed the following in the host kernel log around the time the guest
> encountered BSOD on 5.2.7:
> 
> [  337.841491] WARNING: CPU: 6 PID: 7548 at arch/x86/kvm/x86.c:7963
> kvm_arch_vcpu_ioctl_run+0x19b1/0x1b00 [kvm]

Rats, I overlooked this first time round.  In the future, if you get a
WARN splat, try to make it very obvious in the bug report, they're almost
always a smoking gun.

That WARN that fired is:

        /* The preempt notifier should have taken care of the FPU already.  */
        WARN_ON_ONCE(test_thread_flag(TIF_NEED_FPU_LOAD));

which was added part of a bug fix by commit:

	240c35a3783a ("kvm: x86: Use task structs fpu field for user")

the buggy commit that was fixed is

	5f409e20b794 ("x86/fpu: Defer FPU state load until return to userspace")

which was part of a FPU rewrite that went into 5.2[*].  So yep, big
smoking gun :-)

My understanding of the WARN is that it means the kernel's FPU state is
unexpectedly loaded when entry to the KVM guest is imminent.  As for *how*
the kernel's FPU state is getting loaded, no clue.  But, I think it'd be
pretty easy to find the the culprit by adding a debug flag into struct
thread_info that gets set in vcpu_load() and clearing it in vcpu_put(),
and then WARN in set_ti_thread_flag() if the debug flag is true when
TIF_NEED_FPU_LOAD is being set.  I'll put together a debugging patch later
today and send it your way.

[*] https://lkml.kernel.org/r/20190403164156.19645-1-bigeasy@linutronix.de

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-11-19 20:01               ` Sean Christopherson
@ 2019-11-20  9:19                 ` Wanpeng Li
  2019-11-20  9:57                   ` Paolo Bonzini
  2019-11-20 18:19                 ` Sean Christopherson
  1 sibling, 1 reply; 19+ messages in thread
From: Wanpeng Li @ 2019-11-20  9:19 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Derek Yerger, Alex Williamson, kvm, Bonzini, Paolo

On Wed, 20 Nov 2019 at 04:03, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Wed, Oct 30, 2019 at 11:44:09PM -0400, Derek Yerger wrote:
> >
> > On 10/24/19 1:32 PM, Sean Christopherson wrote:
> > >On Thu, Oct 24, 2019 at 11:18:59AM -0400, Derek Yerger wrote:
> > >>On 10/22/19 4:28 PM, Sean Christopherson wrote:
> > >>>On Thu, Oct 17, 2019 at 07:57:35PM -0400, Derek Yerger wrote:
> > >>>Heh, should've checked from the get go...  It's definitely not the memslot
> > >>>issue, because the memslot bug is in 5.1.16 as well.  :-)
> > >>I didn't pick up on that, nice catch. The memslot thread was the closest
> > >>thing I could find to an educated guess.
> > >>>>I'm stuck on 5.1.x for now, maybe I'll give up and get a dedicated windows
> > >>>>machine /s
> > >>>What hardware are you running on?  I was thinking this was AMD specific,
> > >>>but then realized you said "AMD Radeon 540 GPU" and not "AMD CPU".
> > >>Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
> > >>
> > >>07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> > >>Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7)
> > >>         Subsystem: Gigabyte Technology Co., Ltd Device 22fe
> > >>         Kernel driver in use: vfio-pci
> > >>         Kernel modules: amdgpu
> > >>(plus related audio device)
> > >>
> > >>I can't think of any other data points that would be helpful to solving
> > >>system instability in a guest OS.
> > >Can you bisect starting from v5.2?  Identifying which commit in the kernel
> > >introduced the regression would help immensely.
> > On the host, I have to install NVIDIA GPU drivers with each new kernel
> > build. During the process I discovered that I can't reproduce the issue on
> > any kernel if I skip the *host* GPU drivers and start libvirtd in single
> > mode.
> >
> > I noticed the following in the host kernel log around the time the guest
> > encountered BSOD on 5.2.7:
> >
> > [  337.841491] WARNING: CPU: 6 PID: 7548 at arch/x86/kvm/x86.c:7963
> > kvm_arch_vcpu_ioctl_run+0x19b1/0x1b00 [kvm]
>
> Rats, I overlooked this first time round.  In the future, if you get a
> WARN splat, try to make it very obvious in the bug report, they're almost
> always a smoking gun.
>
> That WARN that fired is:
>
>         /* The preempt notifier should have taken care of the FPU already.  */
>         WARN_ON_ONCE(test_thread_flag(TIF_NEED_FPU_LOAD));
>
> which was added part of a bug fix by commit:
>
>         240c35a3783a ("kvm: x86: Use task structs fpu field for user")
>
> the buggy commit that was fixed is
>
>         5f409e20b794 ("x86/fpu: Defer FPU state load until return to userspace")
>
> which was part of a FPU rewrite that went into 5.2[*].  So yep, big
> smoking gun :-)

Since 5.3-rc2, we have three commits fix it.

commitec269475cba7bc (Revert "kvm: x86: Use task structs fpu field for user")
commite751732486eb3 (KVM: X86: Fix fpu state crash in kvm guest)
commitd9a710e5fc4941 (KVM: X86: Dynamically allocate user_fpu)

    Wanpeng

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-11-20  9:19                 ` Wanpeng Li
@ 2019-11-20  9:57                   ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2019-11-20  9:57 UTC (permalink / raw)
  To: Wanpeng Li, Sean Christopherson; +Cc: Derek Yerger, Alex Williamson, kvm

On 20/11/19 10:19, Wanpeng Li wrote:
> Since 5.3-rc2, we have three commits fix it.
> 
> commitec269475cba7bc (Revert "kvm: x86: Use task structs fpu field for user")
> commite751732486eb3 (KVM: X86: Fix fpu state crash in kvm guest)

These two should have been included in 5.2 though, see
https://bugzilla.kernel.org/show_bug.cgi?id=204209.

So this would be a separate bug in the FPU rewrite.

Paolo

> commitd9a710e5fc4941 (KVM: X86: Dynamically allocate user_fpu)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-11-19 20:01               ` Sean Christopherson
  2019-11-20  9:19                 ` Wanpeng Li
@ 2019-11-20 18:19                 ` Sean Christopherson
  2019-11-20 19:04                   ` Derek Yerger
  1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-11-20 18:19 UTC (permalink / raw)
  To: Derek Yerger; +Cc: Alex Williamson, kvm, Bonzini, Paolo

[-- Attachment #1: Type: text/plain, Size: 1786 bytes --]

On Tue, Nov 19, 2019 at 12:01:33PM -0800, Sean Christopherson wrote:
> On Wed, Oct 30, 2019 at 11:44:09PM -0400, Derek Yerger wrote:
> > I noticed the following in the host kernel log around the time the guest
> > encountered BSOD on 5.2.7:
> > 
> > [  337.841491] WARNING: CPU: 6 PID: 7548 at arch/x86/kvm/x86.c:7963
> > kvm_arch_vcpu_ioctl_run+0x19b1/0x1b00 [kvm]
> 
> Rats, I overlooked this first time round.  In the future, if you get a
> WARN splat, try to make it very obvious in the bug report, they're almost
> always a smoking gun.
> 
> That WARN that fired is:
> 
>         /* The preempt notifier should have taken care of the FPU already.  */
>         WARN_ON_ONCE(test_thread_flag(TIF_NEED_FPU_LOAD));
> 
> which was added part of a bug fix by commit:
> 
> 	240c35a3783a ("kvm: x86: Use task structs fpu field for user")
> 
> the buggy commit that was fixed is
> 
> 	5f409e20b794 ("x86/fpu: Defer FPU state load until return to userspace")
> 
> which was part of a FPU rewrite that went into 5.2[*].  So yep, big
> smoking gun :-)
> 
> My understanding of the WARN is that it means the kernel's FPU state is
> unexpectedly loaded when entry to the KVM guest is imminent.  As for *how*
> the kernel's FPU state is getting loaded, no clue.  But, I think it'd be
> pretty easy to find the the culprit by adding a debug flag into struct
> thread_info that gets set in vcpu_load() and clearing it in vcpu_put(),
> and then WARN in set_ti_thread_flag() if the debug flag is true when
> TIF_NEED_FPU_LOAD is being set.  I'll put together a debugging patch later
> today and send it your way.

Debug patch attached.  Hopefully it finds something, it took me an
embarassing number of attempts to get correct, I kept screwing up checking
a bit number versus checking a bit mask...

[-- Attachment #2: 0001-thread_info-Add-a-debug-hook-to-detect-FPU-changes-w.patch --]
[-- Type: text/x-diff, Size: 1942 bytes --]

From 6288031dacbe753b84515d330f62c1f8ed31d932 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson@intel.com>
Date: Wed, 20 Nov 2019 10:12:56 -0800
Subject: [PATCH] thread_info: Add a debug hook to detect FPU changes while a
 vCPU is loaded

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/include/asm/thread_info.h | 2 ++
 arch/x86/kvm/x86.c                 | 4 ++++
 include/linux/thread_info.h        | 1 +
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f9453536f9bb..7b697005cc51 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -56,6 +56,8 @@ struct task_struct;
 struct thread_info {
 	unsigned long		flags;		/* low level flags */
 	u32			status;		/* thread synchronous flags */
+	bool			vcpu_loaded;
+
 };
 
 #define INIT_THREAD_INFO(tsk)			\
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a8ad3a4d86b1..3d9c049e749e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3303,6 +3303,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	}
 
 	kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+
+	current_thread_info()->vcpu_loaded = 1;
 }
 
 static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
@@ -3322,6 +3324,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	int idx;
 
+	current_thread_info()->vcpu_loaded = 0;
+
 	if (vcpu->preempted)
 		vcpu->arch.preempted_in_kernel = !kvm_x86_ops->get_cpl(vcpu);
 
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 8d8821b3689a..016c2c887354 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -52,6 +52,7 @@ enum {
 
 static inline void set_ti_thread_flag(struct thread_info *ti, int flag)
 {
+	WARN_ON_ONCE(ti->vcpu_loaded && flag == TIF_NEED_FPU_LOAD);
 	set_bit(flag, (unsigned long *)&ti->flags);
 }
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-11-20 18:19                 ` Sean Christopherson
@ 2019-11-20 19:04                   ` Derek Yerger
  2019-11-20 19:28                     ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Derek Yerger @ 2019-11-20 19:04 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Alex Williamson, kvm, Bonzini, Paolo


> Debug patch attached.  Hopefully it finds something, it took me an
> embarassing number of attempts to get correct, I kept screwing up checking
> a bit number versus checking a bit mask...
> <0001-thread_info-Add-a-debug-hook-to-detect-FPU-changes-w.patch>

Should this still be tested despite Wanpeng Li’s comments that the issue may have been fixed in a 5.3 release candidate?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-11-20 19:04                   ` Derek Yerger
@ 2019-11-20 19:28                     ` Sean Christopherson
  2019-11-27 15:24                       ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-11-20 19:28 UTC (permalink / raw)
  To: Derek Yerger; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On Wed, Nov 20, 2019 at 02:04:38PM -0500, Derek Yerger wrote:
> 
> > Debug patch attached.  Hopefully it finds something, it took me an
> > embarassing number of attempts to get correct, I kept screwing up checking
> > a bit number versus checking a bit mask...
> > <0001-thread_info-Add-a-debug-hook-to-detect-FPU-changes-w.patch>
> 
> Should this still be tested despite Wanpeng Li’s comments that the issue may
> have been fixed in a 5.3 release candidate?

Yes.

The actual bug fix, commit e751732486eb3 (KVM: X86: Fix fpu state crash in
kvm guest), is present in v5.2.7.

Unless there's a subtlety I'm missing, commit d9a710e5fc4941 (KVM: X86:
Dynamically allocate user_fpu) is purely an optimization and should not
have a functional impact.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-11-20 19:28                     ` Sean Christopherson
@ 2019-11-27 15:24                       ` Sean Christopherson
  2019-12-17 23:11                         ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2019-11-27 15:24 UTC (permalink / raw)
  To: Derek Yerger; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On Wed, Nov 20, 2019 at 11:28:43AM -0800, Sean Christopherson wrote:
> On Wed, Nov 20, 2019 at 02:04:38PM -0500, Derek Yerger wrote:
> > 
> > > Debug patch attached.  Hopefully it finds something, it took me an
> > > embarassing number of attempts to get correct, I kept screwing up checking
> > > a bit number versus checking a bit mask...
> > > <0001-thread_info-Add-a-debug-hook-to-detect-FPU-changes-w.patch>
> > 
> > Should this still be tested despite Wanpeng Li’s comments that the issue may
> > have been fixed in a 5.3 release candidate?
> 
> Yes.
> 
> The actual bug fix, commit e751732486eb3 (KVM: X86: Fix fpu state crash in
> kvm guest), is present in v5.2.7.
> 
> Unless there's a subtlety I'm missing, commit d9a710e5fc4941 (KVM: X86:
> Dynamically allocate user_fpu) is purely an optimization and should not
> have a functional impact.

---

Any chance the below change fixes your issue?  It's a bug fix for AVX
corruption during signal delivery[*].  It doesn't seem like the same thing
you are seeing, but it's worth trying.

[*] https://lkml.kernel.org/r/20191127124243.u74osvlkhcmsskng@linutronix.de/

 arch/x86/include/asm/fpu/internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 4c95c365058aa..44c48e34d7994 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -509,7 +509,7 @@ static inline void __fpu_invalidate_fpregs_state(struct fpu *fpu)
 
 static inline int fpregs_state_valid(struct fpu *fpu, unsigned int cpu)
 {
-	return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
+	return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-11-27 15:24                       ` Sean Christopherson
@ 2019-12-17 23:11                         ` Sean Christopherson
  2019-12-17 23:13                           ` Derek Yerger
  2020-01-02 13:42                           ` Derek Yerger
  0 siblings, 2 replies; 19+ messages in thread
From: Sean Christopherson @ 2019-12-17 23:11 UTC (permalink / raw)
  To: Derek Yerger; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On Wed, Nov 27, 2019 at 07:24:09AM -0800, Sean Christopherson wrote:
> On Wed, Nov 20, 2019 at 11:28:43AM -0800, Sean Christopherson wrote:
> > On Wed, Nov 20, 2019 at 02:04:38PM -0500, Derek Yerger wrote:
> > > 
> > > > Debug patch attached.  Hopefully it finds something, it took me an
> > > > embarassing number of attempts to get correct, I kept screwing up checking
> > > > a bit number versus checking a bit mask...
> > > > <0001-thread_info-Add-a-debug-hook-to-detect-FPU-changes-w.patch>
> > > 
> > > Should this still be tested despite Wanpeng Li’s comments that the issue may
> > > have been fixed in a 5.3 release candidate?
> > 
> > Yes.
> > 
> > The actual bug fix, commit e751732486eb3 (KVM: X86: Fix fpu state crash in
> > kvm guest), is present in v5.2.7.
> > 
> > Unless there's a subtlety I'm missing, commit d9a710e5fc4941 (KVM: X86:
> > Dynamically allocate user_fpu) is purely an optimization and should not
> > have a functional impact.

Any update on this?  Syzkaller also appears to be hitting this[*], but it
hasn't been able to generate a reproducer.

[*] https://syzkaller.appspot.com/bug?extid=00be5da1d75f1cc95f6b


> ---
> 
> Any chance the below change fixes your issue?  It's a bug fix for AVX
> corruption during signal delivery[*].  It doesn't seem like the same thing
> you are seeing, but it's worth trying.
> 
> [*] https://lkml.kernel.org/r/20191127124243.u74osvlkhcmsskng@linutronix.de/
> 
>  arch/x86/include/asm/fpu/internal.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 4c95c365058aa..44c48e34d7994 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -509,7 +509,7 @@ static inline void __fpu_invalidate_fpregs_state(struct fpu *fpu)
>  
>  static inline int fpregs_state_valid(struct fpu *fpu, unsigned int cpu)
>  {
> -	return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
> +	return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-12-17 23:11                         ` Sean Christopherson
@ 2019-12-17 23:13                           ` Derek Yerger
  2020-01-02 13:42                           ` Derek Yerger
  1 sibling, 0 replies; 19+ messages in thread
From: Derek Yerger @ 2019-12-17 23:13 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On 12/17/19 6:11 PM, Sean Christopherson wrote:
> On Wed, Nov 27, 2019 at 07:24:09AM -0800, Sean Christopherson wrote:
>> On Wed, Nov 20, 2019 at 11:28:43AM -0800, Sean Christopherson wrote:
>>> On Wed, Nov 20, 2019 at 02:04:38PM -0500, Derek Yerger wrote:
>>>>> Debug patch attached.  Hopefully it finds something, it took me an
>>>>> embarassing number of attempts to get correct, I kept screwing up checking
>>>>> a bit number versus checking a bit mask...
>>>>> <0001-thread_info-Add-a-debug-hook-to-detect-FPU-changes-w.patch>
>>>> Should this still be tested despite Wanpeng Li’s comments that the issue may
>>>> have been fixed in a 5.3 release candidate?
>>> Yes.
>>>
>>> The actual bug fix, commit e751732486eb3 (KVM: X86: Fix fpu state crash in
>>> kvm guest), is present in v5.2.7.
>>>
>>> Unless there's a subtlety I'm missing, commit d9a710e5fc4941 (KVM: X86:
>>> Dynamically allocate user_fpu) is purely an optimization and should not
>>> have a functional impact.
> Any update on this?  Syzkaller also appears to be hitting this[*], but it
> hasn't been able to generate a reproducer.
>
> [*] https://syzkaller.appspot.com/bug?extid=00be5da1d75f1cc95f6b
I have the kernel built and ready to test. I need the guest VM in a functioning 
state this week, so I can't test yet. I will post results as soon as they're 
available.
>
>> ---
>>
>> Any chance the below change fixes your issue?  It's a bug fix for AVX
>> corruption during signal delivery[*].  It doesn't seem like the same thing
>> you are seeing, but it's worth trying.
>>
>> [*] https://lkml.kernel.org/r/20191127124243.u74osvlkhcmsskng@linutronix.de/
>>
>>   arch/x86/include/asm/fpu/internal.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
>> index 4c95c365058aa..44c48e34d7994 100644
>> --- a/arch/x86/include/asm/fpu/internal.h
>> +++ b/arch/x86/include/asm/fpu/internal.h
>> @@ -509,7 +509,7 @@ static inline void __fpu_invalidate_fpregs_state(struct fpu *fpu)
>>   
>>   static inline int fpregs_state_valid(struct fpu *fpu, unsigned int cpu)
>>   {
>> -	return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
>> +	return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu;
>>   }


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: PROBLEM: Regression of MMU causing guest VM application errors
  2019-12-17 23:11                         ` Sean Christopherson
  2019-12-17 23:13                           ` Derek Yerger
@ 2020-01-02 13:42                           ` Derek Yerger
  1 sibling, 0 replies; 19+ messages in thread
From: Derek Yerger @ 2020-01-02 13:42 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Alex Williamson, kvm, Bonzini, Paolo

On 12/17/19 6:11 PM, Sean Christopherson wrote:
> On Wed, Nov 27, 2019 at 07:24:09AM -0800, Sean Christopherson wrote:
>> On Wed, Nov 20, 2019 at 11:28:43AM -0800, Sean Christopherson wrote:
>>> On Wed, Nov 20, 2019 at 02:04:38PM -0500, Derek Yerger wrote:
>>>>> Debug patch attached.  Hopefully it finds something, it took me an
>>>>> embarassing number of attempts to get correct, I kept screwing up checking
>>>>> a bit number versus checking a bit mask...
>>>>> <0001-thread_info-Add-a-debug-hook-to-detect-FPU-changes-w.patch>
>>>> Should this still be tested despite Wanpeng Li’s comments that the issue may
>>>> have been fixed in a 5.3 release candidate?
>>> Yes.
>>>
>>> The actual bug fix, commit e751732486eb3 (KVM: X86: Fix fpu state crash in
>>> kvm guest), is present in v5.2.7.
>>>
>>> Unless there's a subtlety I'm missing, commit d9a710e5fc4941 (KVM: X86:
>>> Dynamically allocate user_fpu) is purely an optimization and should not
>>> have a functional impact.
> Any update on this?  Syzkaller also appears to be hitting this[*], but it
> hasn't been able to generate a reproducer.
>
> [*] https://syzkaller.appspot.com/bug?extid=00be5da1d75f1cc95f6b
>
Still working on it. Not sure why but now my initrd images have quadrupled in 
size with the latest kernel, so I'm at an impasse and stuck at 5.2 until I can 
size up my /boot

Will try to fix this week.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-01-02 13:42 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-16  4:49 PROBLEM: Regression of MMU causing guest VM application errors Derek Yerger
2019-10-16  7:28 ` Paolo Bonzini
2019-10-16 17:28 ` Alex Williamson
2019-10-16 17:49   ` Sean Christopherson
2019-10-17 23:57     ` Derek Yerger
2019-10-22 20:28       ` Sean Christopherson
2019-10-24 15:18         ` Derek Yerger
2019-10-24 17:32           ` Sean Christopherson
2019-10-31  3:44             ` Derek Yerger
2019-11-19 20:01               ` Sean Christopherson
2019-11-20  9:19                 ` Wanpeng Li
2019-11-20  9:57                   ` Paolo Bonzini
2019-11-20 18:19                 ` Sean Christopherson
2019-11-20 19:04                   ` Derek Yerger
2019-11-20 19:28                     ` Sean Christopherson
2019-11-27 15:24                       ` Sean Christopherson
2019-12-17 23:11                         ` Sean Christopherson
2019-12-17 23:13                           ` Derek Yerger
2020-01-02 13:42                           ` Derek Yerger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.