All of lore.kernel.org
 help / color / mirror / Atom feed
* Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines
@ 2022-05-01 17:26 Larry Finger
  2022-05-01 17:47 ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Larry Finger @ 2022-05-01 17:26 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: LKML

Jason,

I maintain VirtualBox for openSUSE. When kernel 5.18-rc1 was released, I fixed 
the usual set of API changes needed to compile the external kernel modules for 
VB. Despite a clean compile, I am still getting random crashes in the VMs. For 
Linux instances, the desktop disappears, but for Windows guests, the VM crashes 
with unhandled kernel exceptions. As I have no experience tracing such crashes, 
I decided to bisect the kernel to find the commit that started these problems.

Surprisingly, the bisection pointed to commit 6e8ec2552c7d ("random: use 
computational hash for entropy extraction"). I am very sure of the bisection as 
the kernel built from the commit that immediately precedes this one, 
cfb92440ee71 - a tag commit by Linus, runs correctly.

Note that I do not believe there is anything wrong with your changes to the 
random number generators. It seems to be a problem with the way the emulator is 
accessing them. The VirtualBox code is quite complicated, and I am no expert 
with C++.

Are there changes that would be required to the X86_64 emulator's access to the 
random number code as a result of your changes? I have found places where the 
emulator accesses /dev/urandom or /dev/random. There are also places that use 
the rdrand and reseed instructions.

Thanks for reading this,

Larry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines
  2022-05-01 17:26 Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines Larry Finger
@ 2022-05-01 17:47 ` Jason A. Donenfeld
  2022-05-01 21:07   ` Larry Finger
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-05-01 17:47 UTC (permalink / raw)
  To: Larry Finger; +Cc: LKML

Hi Larry,

Thanks for the report. Several questions:

1) Can you reproduce with 5.18-rc4?

2) Can you send me a stacktrace from the crash or any relevant console
   output?

3) Does the crash happen in the guest or the host?

Question two is very important.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines
  2022-05-01 17:47 ` Jason A. Donenfeld
@ 2022-05-01 21:07   ` Larry Finger
  2022-05-01 23:32     ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Larry Finger @ 2022-05-01 21:07 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: LKML

On 5/1/22 12:47, Jason A. Donenfeld wrote:
> Hi Larry,
> 
> Thanks for the report. Several questions:
> 
> 1) Can you reproduce with 5.18-rc4?
> 
> 2) Can you send me a stacktrace from the crash or any relevant console
>     output?
> 
> 3) Does the crash happen in the guest or the host?
> 
> Question two is very important.

Jason,

1. Yes, the problem happens with 5.18-rc4 and -rc5.

3. The crash is in the guest. Nothing unusual is logged in the host.

2. My answer here will be incomplete. There are no stacktraces or console ouput 
on the host from any of the guest crashes, either in dmesg or under journalctl. 
The desktop just disappears. The VirtualBox log files show nothing for the Linux 
guest, and the following for the Windows instance:

00:00:57.908011 GUI: UIMachineLogicNormal::sltCheckForRequestedVisualStateType: 
Requested-state=0, Machine-state=5
00:01:24.502961 GIM: HyperV: Guest indicates a fatal condition! P0=0x1e 
P1=0xffffffffc0000005 P2=0xfffff8054c61e97c P3=0x0 P4=0x28
00:01:24.503053 GIMHv: BugCheck 1e {ffffffffc0000005, fffff8054c61e97c, 0, 28}
00:01:24.503054 KMODE_EXCEPTION_NOT_HANDLED
00:01:24.503054 P1: ffffffffc0000005 - exception code - STATUS_ACCESS_VIOLATION
00:01:24.503054 P2: fffff8054c61e97c - EIP/RIP
00:01:24.503054 P3: 0000000000000000 - Xcpt param #0
00:01:24.503054 P4: 0000000000000028 - Xcpt param #1

Running a 3rd party dump analyzer shows that the crash happens at 
ntoskrnl.exe+3f7d50. I have installed the Windows debugger, but I think the 
learning curve will be steep. At this point, I have no further info available.

Thanks,

Larry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines
  2022-05-01 21:07   ` Larry Finger
@ 2022-05-01 23:32     ` Jason A. Donenfeld
  2022-05-02  0:11       ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-05-01 23:32 UTC (permalink / raw)
  To: Larry Finger; +Cc: LKML

Hi Larry,

On Sun, May 01, 2022 at 04:07:39PM -0500, Larry Finger wrote:
> 1. Yes, the problem happens with 5.18-rc4 and -rc5.

Do you still have your bisection logs handy? Something about this seems
a bit fishy to me, and it might be helpful.

> 2. My answer here will be incomplete. There are no stacktraces or console ouput 

You're going to have to make it more complete somehow...

> on the host from any of the guest crashes, either in dmesg or under journalctl. 
> The desktop just disappears. The VirtualBox log files show nothing for the Linux 

What do you mean "just disappears"? What is the "desktop"? Do you mean
that the X server segfaults or something? Can you attach a debugger
somewhere and try again? There's got to be something you can do to get
more info.

> guest, and the following for the Windows instance:
> 
> 00:00:57.908011 GUI: UIMachineLogicNormal::sltCheckForRequestedVisualStateType: 
> Requested-state=0, Machine-state=5
> 00:01:24.502961 GIM: HyperV: Guest indicates a fatal condition! P0=0x1e 
> P1=0xffffffffc0000005 P2=0xfffff8054c61e97c P3=0x0 P4=0x28
> 00:01:24.503053 GIMHv: BugCheck 1e {ffffffffc0000005, fffff8054c61e97c, 0, 28}
> 00:01:24.503054 KMODE_EXCEPTION_NOT_HANDLED
> 00:01:24.503054 P1: ffffffffc0000005 - exception code - STATUS_ACCESS_VIOLATION
> 00:01:24.503054 P2: fffff8054c61e97c - EIP/RIP
> 00:01:24.503054 P3: 0000000000000000 - Xcpt param #0
> 00:01:24.503054 P4: 0000000000000028 - Xcpt param #1
> 
> Running a 3rd party dump analyzer shows that the crash happens at 
> ntoskrnl.exe+3f7d50. I have installed the Windows debugger, but I think the 
> learning curve will be steep. At this point, I have no further info available.

Can you email me the minidump files from the crash? In another life
that's not supposed to intersect with lkml, windbg keeps me up at
night...

Also, if you've got some easy steps at repro, that'd be helpful. If I
have to install OpenSUSE in a VM or something and type some commands and
twiddle things here and there, let me know what it takes to get an
environment going. Or, better, if you've got a VM already baked with vbox
installed in it with a VM inside of that that exhibits the issue, that'd
let me take a look.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines
  2022-05-01 23:32     ` Jason A. Donenfeld
@ 2022-05-02  0:11       ` Jason A. Donenfeld
  2022-05-02  1:05         ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-05-02  0:11 UTC (permalink / raw)
  To: Larry Finger; +Cc: LKML

Hey again,

I just installed VirtualBox ontop of 5.18-rc4, and then I made a new VM
with a fresh install of OpenSUSE, and everything is fine. No issues at
all.

So you're going to have to provide more information.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines
  2022-05-02  0:11       ` Jason A. Donenfeld
@ 2022-05-02  1:05         ` Jason A. Donenfeld
  2022-05-02 10:49           ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-05-02  1:05 UTC (permalink / raw)
  To: Larry Finger; +Cc: LKML

Hi Larry,

On Mon, May 02, 2022 at 02:11:13AM +0200, Jason A. Donenfeld wrote:
> Hey again,
> 
> I just installed VirtualBox ontop of 5.18-rc4, and then I made a new VM
> with a fresh install of OpenSUSE, and everything is fine. No issues at
> all.
> 
> So you're going to have to provide more information.
> 
> Jason

With still no more information provided from you, I've gone scouring and
found your much more informative bug report here:
https://www.virtualbox.org/ticket/20914 along with a larger log here
https://www.virtualbox.org/attachment/ticket/20914/Windows%2010%20Clone-2022-04-24-20-55-56.log

Why would you not have sent me all this information right away? Surely
you know how to report bugs. If you're going to concern me with the
possibility that I've broken something, at least give me enough detail
to be able to do something. Otherwise it's pure frustration.

Anyway, it's still too little information, but I could extract the
Windows build from that log file, pull down ntoskrnl.exe and hope it
roughly matches, and then go to work in IDA Pro trying to figure out
what's going on at ntoskrnl.exe+3f7d50, and if I managed to grab the
right build -- which I more than likely did not -- then that's a `mov
byte ptr gs:853h, 0` in KiInterruptDispatch, which seems entirely
unrelated to the change you mentioned.

So I think it'd be a good moment for you to show your bisect logs so we
can be certain we're after the right thing.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines
  2022-05-02  1:05         ` Jason A. Donenfeld
@ 2022-05-02 10:49           ` Jason A. Donenfeld
  0 siblings, 0 replies; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-05-02 10:49 UTC (permalink / raw)
  To: Larry Finger; +Cc: LKML

Hi Larry,

On Mon, May 2, 2022 at 4:55 AM Larry Finger <Larry.Finger@lwfinger.net> wrote:
> On 5/1/22 20:05, Jason A. Donenfeld wrote:
> > Hi Larry,
> >
> > On Mon, May 02, 2022 at 02:11:13AM +0200, Jason A. Donenfeld wrote:
> >> Hey again,
> >>
> >> I just installed VirtualBox ontop of 5.18-rc4, and then I made a new VM
> >> with a fresh install of OpenSUSE, and everything is fine. No issues at
> >> all.
> >>
> >> So you're going to have to provide more information.
> >>
> >> Jason
> >
> > With still no more information provided from you, I've gone scouring and
> > found your much more informative bug report here:
> > https://www.virtualbox.org/ticket/20914 along with a larger log here
> > https://www.virtualbox.org/attachment/ticket/20914/Windows%2010%20Clone-2022-04-24-20-55-56.log
> >
> > Why would you not have sent me all this information right away? Surely
> > you know how to report bugs. If you're going to concern me with the
> > possibility that I've broken something, at least give me enough detail
> > to be able to do something. Otherwise it's pure frustration.
> >
> > Anyway, it's still too little information, but I could extract the
> > Windows build from that log file, pull down ntoskrnl.exe and hope it
> > roughly matches, and then go to work in IDA Pro trying to figure out
> > what's going on at ntoskrnl.exe+3f7d50, and if I managed to grab the
> > right build -- which I more than likely did not -- then that's a `mov
> > byte ptr gs:853h, 0` in KiInterruptDispatch, which seems entirely
> > unrelated to the change you mentioned.
> >
> > So I think it'd be a good moment for you to show your bisect logs so we
> > can be certain we're after the right thing.
>
> LKML removed from cc due to large files.
>
> Yes, I do know how to report bugs. If you remember my first E-mail, I was just
> looking for some suggestions on how using rdrand and rdseed could conflict with
> your changes. I'm sorry that you think I'm wasting your time.
>
> Where did you get your copy of VirtualBox? Perhaps they have some fixes that I
> do not know about.

I patched
<https://dev.gentoo.org/~polynomial-c/virtualbox/vbox-kernel-module-src-6.1.34.tar.xz>
using <https://xn--4db.cc/AtB1jwli>.

> My bisect logs are gone. I will need to recreate them and I should have them
> tomorrow. I do have my paper log to create the bisect. I will have it for you
> tomorrow.
>
> I ran the VM again and got a slightly different result. The kernel exception was
> at ntoskrnl.exe+458647.The mini dump is attached. The ntosknl.exe is available
> at https:/lwfinger.com/download/ntosknl.exe.gz.

You spelled your URL wrong in two places. Had to guess how to fix it.
Please spend more time with your bug reports. This is already more
painful than it should be.

From looking at the minidump you sent, I don't see how this is related
to the RNG. Maybe something else is wrong with your VirtualBox, and
you're just experiencing a 5.17->5.18 transition. The VirtualBox team
themselves said they haven't released the modules for 5.18 yet.
Then on top of that, maybe you're bisecting wrong.

Anyway, from that minidump...

PROCESS_NAME:  svchost.exe
STACK_TEXT:
ffff8603`177407f8 fffff806`30464647     : 00000000`0000001e ffffffff`c0000005 fffff806`3062797c 00000000`00000000 : nt!KeBugCheckEx
ffff8603`17740800 fffff806`30415dac     : 00000000`00001000 ffff8603`177410a0 ffff8000`00000000 00000000`00000000 : nt!KiDispatchException+0x17c287
ffff8603`17740ec0 fffff806`30411f43     : 00000000`00000001 ffffa20d`a3e00340 00000000`00000060 00000000`00000000 : nt!KiExceptionDispatch+0x12c
ffff8603`177410a0 fffff806`3062797c     : 00000000`000000c8 fffff806`30248da4 00000000`00000000 00000000`00000001 : nt!KiPageFault+0x443
ffff8603`17741230 fffff806`3064606e     : 00000000`00000000 ffffdd8e`e4fe9970 00000000`00000000 00000000`00000000 : nt!MiPfPrepareReadList+0x4c
ffff8603`17741320 fffff806`30645de4     : ffffa20d`ac52dcc0 00000000`00000000 00000000`00000000 ffffdd8e`e4fe9970 : nt!MmPrefetchPagesEx+0x96
ffff8603`17741390 fffff806`3064b349     : 00000000`00000000 ffff8603`00000000 ffffa20d`00000000 00000000`00000006 : nt!PfpPrefetchFilesTrickle+0x2a8
ffff8603`17741480 fffff806`3064bb6e     : ffffa20d`abf59000 ffffa20d`abf59000 ffff8603`177416a0 00000000`00000000 : nt!PfpPrefetchRequestPerform+0x299
ffff8603`177415f0 fffff806`30651679     : 00000000`00000001 fffff806`302c0c01 ffffdd8e`e9e81760 ffffa20d`abf59000 : nt!PfpPrefetchRequest+0x132
ffff8603`17741670 fffff806`3065050d     : ffffdd8e`00000000 00000000`00000000 00000000`1d16c86a 00000000`1d16c801 : nt!PfSetSuperfetchInformation+0x155
ffff8603`17741770 fffff806`304156b5     : 00000000`00000000 00000000`00000000 ffff8603`17741b80 00000000`00000000 : nt!NtSetSystemInformation+0x9bd
ffff8603`17741b00 00007fff`5b9b0274     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25
00000075`ba37f9c8 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007fff`5b9b0274
SYMBOL_NAME:  nt!MiPfPrepareReadList+4c
MODULE_NAME: nt
IMAGE_VERSION:  10.0.19041.1682

Loading up the kernel image, we see:

PAGE:000000014061B946                 mov     r13, rcx
[...]
PAGE:000000014061B96F                 mov     rax, [r13+0]
[...]
PAGE:000000014061B97C                 mov     rdx, [rax+28h]

So it dereferences the first argument of MiPfPrepareReadList(), and then
dereferences offset 0x28 of that, and crashes there. Looks like the same
thing happens in your other traces too, based on the bugcheck code
showing offset 0x28 in those too.

Anyway, until I can see that bisect log, this is beginning to smell like
a big waste of time.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines
       [not found] <SJ0PR10MB5742C02D9F7F146A1313BD5DE9CE9@SJ0PR10MB5742.namprd10.prod.outlook.com>
@ 2022-05-17 18:33 ` Larry Finger
  0 siblings, 0 replies; 8+ messages in thread
From: Larry Finger @ 2022-05-17 18:33 UTC (permalink / raw)
  To: Vadim Galitsin, larry.finger, Jason; +Cc: LKML

On 5/17/22 12:27, Vadim Galitsin wrote:
> Hi Larry and Jason,
> 
> I am from VirtualBox team. I noticed your conversation here:
> 
> https://lore.kernel.org/lkml/Ym8uPcuQpq1xBS6d@zx2c4.com/T/#mea7aa731b5524a05ac3b3e8588c0c42235bb33d6 
> <https://lore.kernel.org/lkml/Ym8uPcuQpq1xBS6d@zx2c4.com/T/#mea7aa731b5524a05ac3b3e8588c0c42235bb33d6>
> 
> Please let me add my 5c. I agree with Larry, the issue start happen after 
> 6e8ec2552c7d. I did not do complete bisecting, but rather tried this revision 
> and the one before (with dcd03ba15947cbad1a34cfed370c4feb41058469 -- I do not 
> see the issue).
> 
> For me this issue is quite reproducible with Ubuntu 20.04 Linux guest (other 
> guests are also affected). It happens even if there is no VBox Guest Additions 
> installed into guest. Guest kernel version does not play much role. Running 
> kernel 5.18-rc1+ on the host side is essential.
> 
> The first way for me to reproduce it -- is to run stress-ng(1) tool inside guest 
> and perform random mouse cursor movements (basically, mouse or keyboard 
> interrupts generation is somehow essential here). Tool will report the following 
> error:
> 
> root@test-VirtualBox:~# stress-ng --vm 4 -t 10
> stress-ng: info:  [5463] dispatching hogs: 4 vm
> stress-ng: fail:  [5464] stress-ng-vm: detected 194065152 bit errors while 
> stressing memory
> stress-ng: error: [5463] process 5464 (stress-ng-vm) terminated with an error, 
> exit status=1 (stress-ng core failure)
> stress-ng: info:  [5463] unsuccessful run completed in 10.06s
> 
> This approach does not work in 100% cases, but triggers issue quite frequently.
> 
> The second approach is much more reliable for me. I basically, start compiling 
> kernel inside guest (say, with make -j4) and start moving mouse (or generate 
> keyboard interrupts, pressing keys randomly). In this case, gcc processes will 
> randomly receive SEGFAULT.
> 
> Important note: if I do not touch mouse or keyboard in both cases above -- all 
> works as normal.
> 
> My initial guess was that this might have something to do with kstack 
> randomization, but booting host kernel with randomize_kstack_offset=0 seem does 
> not change anything in this regard.
> 
> I am currently running out of ideas what exactly might trigger such behavior. 
> Hopefully, this additional info might shed additional light.
> 
> Best regards,
> Vadim
> 

Vadim,

I had an extended E-mail interchange with Jason Donenfeld over this issue. Sorry 
that most of this was private because some large files needed to be transmitted 
that were not appropriate for LKML. LKML is added back in to this reply.

My test for the fault was to start a VM running Windows 10 and use Edge to load 
the VirtualBox web page. Usually within a few seconds, Edge or Windows would 
crash. In the latter case, the log for the VM might show an unhandled exception 
while in kernel mode. I thought the browser was hitting the random number 
generator hard, but there is mouse activity, of course.

Jason has created a patch entitled "random: do not use input pool from hard 
IRQs" that fixes the problem for me. It can be found at 
https://lore.kernel.org/lkml/20220510140025.81168-1-Jason@zx2c4.com/. I had 
expected this patch to be merged into the mainline kernel by now. Jason should 
be able to shed light on any delays.

The bottom line and good news for Oracle/VirtualBox and those of us that package 
VB for distros is that this is a kernel regression - which is a conclusion I 
hesitated to make earlier. It is not a problem with VirtualBox, VB just exposes 
the kernel problem.

I certainly hope that this problem is fixed before 5.18 is released. If not, I 
will need to campaign to prevent openSUSE Tumbleweed from switching to 5.18. 
That would normally happen with the release of 5.18.1!

Larry


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-05-17 18:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-01 17:26 Changes in kernel 5.18-rc1 leads to crashes in VirtualBox Virtual Machines Larry Finger
2022-05-01 17:47 ` Jason A. Donenfeld
2022-05-01 21:07   ` Larry Finger
2022-05-01 23:32     ` Jason A. Donenfeld
2022-05-02  0:11       ` Jason A. Donenfeld
2022-05-02  1:05         ` Jason A. Donenfeld
2022-05-02 10:49           ` Jason A. Donenfeld
     [not found] <SJ0PR10MB5742C02D9F7F146A1313BD5DE9CE9@SJ0PR10MB5742.namprd10.prod.outlook.com>
2022-05-17 18:33 ` Larry Finger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.