All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel
@ 2020-08-30 15:22 bugzilla-daemon
  2020-09-09  6:00 ` [Bug 209079] " bugzilla-daemon
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: bugzilla-daemon @ 2020-08-30 15:22 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209079

            Bug ID: 209079
           Summary: CPU 0/KVM: page allocation failure on 5.8 kernel
           Product: Virtualization
           Version: unspecified
    Kernel Version: 5.8.5-arch1-1
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: kvm
          Assignee: virtualization_kvm@kernel-bugs.osdl.org
          Reporter: kernel@martin.schrodt.org
        Regression: No

When starting my KVM-VM in the current 5.8 kernel, it won't start, complaining:

> internal error: qemu unexpectedly closed the monitor:
> 2020-08-30T15:16:10.389012Z qemu-system-x86_64: kvm_init_vcpu failed: Cannot
> allocate memory

The same VM works fine in a 5.7 kernel. I tried an earlier 5.8 kernel too, same
outcome.

dmesg shows the following:

[Sun Aug 30 17:16:09 2020] CPU 0/KVM: page allocation failure: order:0,
mode:0x400cc4(GFP_KERNEL_ACCOUNT|GFP_DMA32),
nodemask=(null),cpuset=emulator,mems_allowed=1
[Sun Aug 30 17:16:09 2020] CPU: 11 PID: 16473 Comm: CPU 0/KVM Tainted: P       
   OE     5.8.5-arch1-1 #1
[Sun Aug 30 17:16:09 2020] Hardware name: To Be Filled By O.E.M. To Be Filled
By O.E.M./X399 Phantom Gaming 6, BIOS P1.10 11/15/2018
[Sun Aug 30 17:16:09 2020] Call Trace:
[Sun Aug 30 17:16:09 2020]  dump_stack+0x6b/0x88
[Sun Aug 30 17:16:09 2020]  warn_alloc.cold+0x78/0xdc
[Sun Aug 30 17:16:09 2020]  __alloc_pages_slowpath.constprop.0+0xd14/0xd50
[Sun Aug 30 17:16:09 2020]  __alloc_pages_nodemask+0x2e4/0x310
[Sun Aug 30 17:16:09 2020]  alloc_mmu_pages+0x27/0x90 [kvm]
[Sun Aug 30 17:16:09 2020]  kvm_mmu_create+0x100/0x140 [kvm]
[Sun Aug 30 17:16:09 2020]  kvm_arch_vcpu_create+0x48/0x360 [kvm]
[Sun Aug 30 17:16:09 2020]  kvm_vm_ioctl+0xa2d/0xe60 [kvm]
[Sun Aug 30 17:16:09 2020]  ksys_ioctl+0x82/0xc0
[Sun Aug 30 17:16:09 2020]  __x64_sys_ioctl+0x16/0x20
[Sun Aug 30 17:16:09 2020]  do_syscall_64+0x44/0x70
[Sun Aug 30 17:16:09 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Sun Aug 30 17:16:09 2020] RIP: 0033:0x7f4e8ba7cf6b
[Sun Aug 30 17:16:09 2020] Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8
1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 ae 0c 00 f7 d8 64 89 01 48
[Sun Aug 30 17:16:09 2020] RSP: 002b:00007f4e6bffe6b8 EFLAGS: 00000246
ORIG_RAX: 0000000000000010
[Sun Aug 30 17:16:09 2020] RAX: ffffffffffffffda RBX: 000000000000ae41 RCX:
00007f4e8ba7cf6b
[Sun Aug 30 17:16:09 2020] RDX: 0000000000000000 RSI: 000000000000ae41 RDI:
0000000000000019
[Sun Aug 30 17:16:09 2020] RBP: 0000563079828020 R08: 0000000000000000 R09:
0000563079844010
[Sun Aug 30 17:16:09 2020] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000000
[Sun Aug 30 17:16:09 2020] R13: 00007fff52f4494f R14: 0000000000000000 R15:
00007f4e6bfff640
[Sun Aug 30 17:16:09 2020] Mem-Info:
[Sun Aug 30 17:16:09 2020] active_anon:414866 inactive_anon:28099
isolated_anon:0
                            active_file:31776 inactive_file:88136
isolated_file:0
                            unevictable:32 dirty:521 writeback:0
                            slab_reclaimable:19827 slab_unreclaimable:137048
                            mapped:142120 shmem:28302 pagetables:6905 bounce:0
                            free:6992691 free_pcp:4628 free_cma:0

System is a Threadripper 1920x, on ASRock Phantom Gaming 6 X399 board with 32GB
RAM, which is a NUMA architecture, having 2 nodes

➜  numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 15966 MB
node 0 free: 12860 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 16112 MB
node 1 free: 14559 MB
node distances:
node   0   1 
  0:  10  16 
  1:  16  10 

The VM is configured to only allocate memory on node 1.

Happy to provide more information!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 209079] CPU 0/KVM: page allocation failure on 5.8 kernel
  2020-08-30 15:22 [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel bugzilla-daemon
@ 2020-09-09  6:00 ` bugzilla-daemon
  2020-09-09  6:41 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2020-09-09  6:00 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209079

Wanpeng Li (wanpeng.li@hotmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wanpeng.li@hotmail.com

--- Comment #1 from Wanpeng Li (wanpeng.li@hotmail.com) ---
It is appreciated if you can bisect.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 209079] CPU 0/KVM: page allocation failure on 5.8 kernel
  2020-08-30 15:22 [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel bugzilla-daemon
  2020-09-09  6:00 ` [Bug 209079] " bugzilla-daemon
@ 2020-09-09  6:41 ` bugzilla-daemon
  2020-09-10 15:33 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2020-09-09  6:41 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209079

Sean Christopherson (sean.j.christopherson@intel.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sean.j.christopherson@intel
                   |                            |.com

--- Comment #2 from Sean Christopherson (sean.j.christopherson@intel.com) ---
Are you disabling NPT (via KVM module param)?  You're obviously running a
64-bit kernel, and presumably that CPU supports NPT, so the only way KVM should
reach the failing allocation is if NPT is being explicitly disabled.  There's
nothing wrong with using shadow paging, it's just uncommon these days.

NPT aside, the interesting part of the failing allocation is that it uses
GFP_DMA32.  I did a quick test to force that allocation on my system and
nothing exploded.  Odds are good the bug is outside of KVM, which means a
bisection is probably necessary.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 209079] CPU 0/KVM: page allocation failure on 5.8 kernel
  2020-08-30 15:22 [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel bugzilla-daemon
  2020-09-09  6:00 ` [Bug 209079] " bugzilla-daemon
  2020-09-09  6:41 ` bugzilla-daemon
@ 2020-09-10 15:33 ` bugzilla-daemon
  2020-09-10 16:21 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2020-09-10 15:33 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209079

Martin Schrodt (kernel@martin.schrodt.org) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |OBSOLETE

--- Comment #3 from Martin Schrodt (kernel@martin.schrodt.org) ---
Damn. 

I did some changes to the VM in the last few days, to make it support AVIC and
that made me change the kvm module parameters, without remembering what they
were before. They are now

> options kvm ignore_msrs=1 report_ignored_msrs=0
> options kvm_amd nested=0 avic=1 npt=1

and Seans post mentioning NPT having to be disabled for the bug to occur, I
updated the kernel again (to 5.8.7), and voilà, the VM works.

So I have to concur that it really was disabled before, but I can't remember
why I did so, maybe because of some bug that only existed when I setup the VM
somewhen in 2018.

Regarding GFP_DMA32, I don't know what it really means. Might be related to me
passing through a GPU, an NVME drive and a USB controller to the VM.

So I guess I'll leave learning how to bisect to my next future incident...

Thank you guys for all the work you do - Linux forever!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 209079] CPU 0/KVM: page allocation failure on 5.8 kernel
  2020-08-30 15:22 [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel bugzilla-daemon
                   ` (2 preceding siblings ...)
  2020-09-10 15:33 ` bugzilla-daemon
@ 2020-09-10 16:21 ` bugzilla-daemon
  2020-09-10 21:03 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2020-09-10 16:21 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209079

--- Comment #4 from Sean Christopherson (sean.j.christopherson@intel.com) ---
GFP_DMA32 is a flag that forces a memory allocation to use physical memory that
is 32-bit addressable, i.e. below the 4g boundary.  Using GFP_DMA32 is
relatively uncommon, e.g. KVM uses that flag if and only if KVM is using or
shadowing 32-bit PAE paging.  The latter case (shadowing) is what is triggered
if NPT is disabled.

Can you try trying running with "kvm_amd nested=0 avic=1 npt=0" and/or "kvm_amd
nested=0 npt=0" on v5.8.7?  I'd like to at least confirm that whatever was
breaking your setup was fixed between v5.8.0 and v5.8.7, even if we don't
bisect to identify exactly what patch fixed the bug.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 209079] CPU 0/KVM: page allocation failure on 5.8 kernel
  2020-08-30 15:22 [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel bugzilla-daemon
                   ` (3 preceding siblings ...)
  2020-09-10 16:21 ` bugzilla-daemon
@ 2020-09-10 21:03 ` bugzilla-daemon
  2020-09-11 16:19 ` bugzilla-daemon
  2020-09-20  9:17 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2020-09-10 21:03 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209079

--- Comment #5 from Martin Schrodt (kernel@martin.schrodt.org) ---
Strange things happen sometimes...

What I did (I did only unload/reload the module after config changes, hoping
this would suffice):

- running with "kvm_amd nested=0 avic=1 npt=0" and "kvm_amd nested=0 npt=0" on
5.8.7, all working fine.

- rolling back to the 5.8.5 kernel I had the bug with, and trying the above
combinations -> working fine

- rolling the VM back to a state before changing it to AVIC (reasonably sure
it's the same) -> working fine, on both 5.8.7 and 5.8.5.

Heisenbugs here they come.

Trying to come up with things that I changed since then but did not roll back
yet:

I have a qemu hook, which did the following: 

1) drop caches, 
2) compact memory 
3) create a cpuset for the host and move all tasks there to free the cores
assigned to the VM (which included a flag for memory migration, so that the
processes would have their memory moved to the non VM node) 
4) then let qemu allocate memory

Since then I changed this to move the compacting step after the moving step (my
thought was that *after* moving the memory from node 1 to node 0, there is more
free space on node 1, compaction should yield better results)

Does the error I initially got say anything about *why* the allocation failed?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 209079] CPU 0/KVM: page allocation failure on 5.8 kernel
  2020-08-30 15:22 [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel bugzilla-daemon
                   ` (4 preceding siblings ...)
  2020-09-10 21:03 ` bugzilla-daemon
@ 2020-09-11 16:19 ` bugzilla-daemon
  2020-09-20  9:17 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2020-09-11 16:19 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209079

--- Comment #6 from Sean Christopherson (sean.j.christopherson@intel.com) ---
Nope, the failure path is common so we can't even glean anything from the
offsets in the stack trace.

In your data dump, both nodes show 10gb+ of free memory so there's plenty of
space for the measly 4kb that KVM is trying to allocate.  My best guess is that
the combination of nodemask/cpuset stuff resulted in a set of constraints that
were impossible to satisfy.

At this point, I'd say just chalk it up to a bad configuration unless you want
to pursue this further.  If there's a kernel bug lurking then odds are someone
will run into again.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug 209079] CPU 0/KVM: page allocation failure on 5.8 kernel
  2020-08-30 15:22 [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel bugzilla-daemon
                   ` (5 preceding siblings ...)
  2020-09-11 16:19 ` bugzilla-daemon
@ 2020-09-20  9:17 ` bugzilla-daemon
  6 siblings, 0 replies; 8+ messages in thread
From: bugzilla-daemon @ 2020-09-20  9:17 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=209079

--- Comment #7 from Martin Schrodt (kernel@martin.schrodt.org) ---
Fully agree. Thanks for your assistance!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-09-20  9:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-30 15:22 [Bug 209079] New: CPU 0/KVM: page allocation failure on 5.8 kernel bugzilla-daemon
2020-09-09  6:00 ` [Bug 209079] " bugzilla-daemon
2020-09-09  6:41 ` bugzilla-daemon
2020-09-10 15:33 ` bugzilla-daemon
2020-09-10 16:21 ` bugzilla-daemon
2020-09-10 21:03 ` bugzilla-daemon
2020-09-11 16:19 ` bugzilla-daemon
2020-09-20  9:17 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.