* kvm splat in mmu_spte_clear_track_bits @ 2017-08-20 23:13 Adam Borowski 2017-08-21 1:26 ` Wanpeng Li 0 siblings, 1 reply; 29+ messages in thread From: Adam Borowski @ 2017-08-20 23:13 UTC (permalink / raw) To: Paolo Bonzini, Radim Krčmář, kvm, linux-kernel Hi! I'm afraid I keep getting a quite reliable, but random, splat when running KVM: ------------[ cut here ]------------ WARNING: CPU: 5 PID: 5826 at arch/x86/kvm/mmu.c:717 mmu_spte_clear_track_bits+0x123/0x170 Modules linked in: tun nbd arc4 rtl8xxxu mac80211 cfg80211 rfkill nouveau video ttm CPU: 5 PID: 5826 Comm: qemu-system-x86 Not tainted 4.13.0-rc5-vanilla-ubsan-00211-g7f680d7ec315 #1 Hardware name: System manufacturer System Product Name/M4A77T, BIOS 2401 05/18/2011 task: ffff880207ef0400 task.stack: ffffc900035e4000 RIP: 0010:mmu_spte_clear_track_bits+0x123/0x170 RSP: 0018:ffffc900035e7ab0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 000000010501cc67 RCX: 0000000000000001 RDX: dead0000000000ff RSI: ffff88020e501df8 RDI: 0000000004140700 RBP: ffffc900035e7ad8 R08: 0000000000000100 R09: 0000000000000003 R10: 0000000000000003 R11: 0000000000000005 R12: 000000000010501c R13: ffffea0004140700 R14: ffff88020e1d0000 R15: 0000000000000000 FS: 00007f0213fbd700(0000) GS:ffff88022fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000022187f000 CR4: 00000000000006e0 Call Trace: drop_spte+0x26/0x130 mmu_page_zap_pte+0xc4/0x160 kvm_mmu_prepare_zap_page+0x65/0x660 kvm_mmu_invalidate_zap_all_pages+0xc5/0x1f0 kvm_mmu_invalidate_zap_pages_in_memslot+0x9/0x10 kvm_page_track_flush_slot+0x86/0xd0 kvm_arch_flush_shadow_memslot+0x9/0x10 __kvm_set_memory_region+0x8fb/0x14f0 kvm_set_memory_region+0x2f/0x50 kvm_vm_ioctl+0x559/0xcc0 ? kvm_vcpu_ioctl+0x171/0x620 ? __switch_to+0x30b/0x740 do_vfs_ioctl+0xbb/0x8d0 ? find_vma+0x23/0x100 ? __fget_light+0x94/0x110 SyS_ioctl+0x86/0xa0 entry_SYSCALL_64_fastpath+0x17/0x98 RIP: 0033:0x7f021c80ddc7 RSP: 002b:00007f0213fbc518 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f021c80ddc7 RDX: 00007f0213fbc5b0 RSI: 000000004020ae46 RDI: 000000000000000a RBP: 0000000000000000 R08: 00007f020c1698a0 R09: 0000000000000000 R10: 00007f020c1698a0 R11: 0000000000000246 R12: 0000000000000006 R13: 00007f022201c000 R14: 0000000000000002 R15: 0000558c3899e550 Code: ae fc 01 48 85 c0 75 1c 4c 89 e7 e8 98 de fd ff 48 8b 05 81 ae fc 01 48 85 c0 74 ba 48 85 c3 0f 95 c3 eb b8 48 85 c3 74 e7 eb dd <0f> ff eb 97 4c 89 e7 66 0f 1f 44 00 00 e8 6b de fd ff eb 97 31 ---[ end trace 16c196134f0dd0a9 ]--- After this, there are hundreds of repeats and lots of secondary damage which kills the host quickly. Usually this happens within a few minutes, but sometimes it takes ~half an hour to reproduce. Because of this, it'd be unpleasant to bisect -- is this problem already known? Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!? ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din ⠈⠳⣄⠀⠀⠀⠀ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-20 23:13 kvm splat in mmu_spte_clear_track_bits Adam Borowski @ 2017-08-21 1:26 ` Wanpeng Li 2017-08-21 19:12 ` Adam Borowski 0 siblings, 1 reply; 29+ messages in thread From: Wanpeng Li @ 2017-08-21 1:26 UTC (permalink / raw) To: Adam Borowski Cc: Paolo Bonzini, Radim Krčmář, kvm, linux-kernel 2017-08-21 7:13 GMT+08:00 Adam Borowski <kilobyte@angband.pl>: > Hi! > I'm afraid I keep getting a quite reliable, but random, splat when running > KVM: I reported something similar before. https://lkml.org/lkml/2017/6/29/64 Regards, Wanpeng Li > > ------------[ cut here ]------------ > WARNING: CPU: 5 PID: 5826 at arch/x86/kvm/mmu.c:717 mmu_spte_clear_track_bits+0x123/0x170 > Modules linked in: tun nbd arc4 rtl8xxxu mac80211 cfg80211 rfkill nouveau video ttm > CPU: 5 PID: 5826 Comm: qemu-system-x86 Not tainted 4.13.0-rc5-vanilla-ubsan-00211-g7f680d7ec315 #1 > Hardware name: System manufacturer System Product Name/M4A77T, BIOS 2401 05/18/2011 > task: ffff880207ef0400 task.stack: ffffc900035e4000 > RIP: 0010:mmu_spte_clear_track_bits+0x123/0x170 > RSP: 0018:ffffc900035e7ab0 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 000000010501cc67 RCX: 0000000000000001 > RDX: dead0000000000ff RSI: ffff88020e501df8 RDI: 0000000004140700 > RBP: ffffc900035e7ad8 R08: 0000000000000100 R09: 0000000000000003 > R10: 0000000000000003 R11: 0000000000000005 R12: 000000000010501c > R13: ffffea0004140700 R14: ffff88020e1d0000 R15: 0000000000000000 > FS: 00007f0213fbd700(0000) GS:ffff88022fd40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 000000022187f000 CR4: 00000000000006e0 > Call Trace: > drop_spte+0x26/0x130 > mmu_page_zap_pte+0xc4/0x160 > kvm_mmu_prepare_zap_page+0x65/0x660 > kvm_mmu_invalidate_zap_all_pages+0xc5/0x1f0 > kvm_mmu_invalidate_zap_pages_in_memslot+0x9/0x10 > kvm_page_track_flush_slot+0x86/0xd0 > kvm_arch_flush_shadow_memslot+0x9/0x10 > __kvm_set_memory_region+0x8fb/0x14f0 > kvm_set_memory_region+0x2f/0x50 > kvm_vm_ioctl+0x559/0xcc0 > ? kvm_vcpu_ioctl+0x171/0x620 > ? __switch_to+0x30b/0x740 > do_vfs_ioctl+0xbb/0x8d0 > ? find_vma+0x23/0x100 > ? __fget_light+0x94/0x110 > SyS_ioctl+0x86/0xa0 > entry_SYSCALL_64_fastpath+0x17/0x98 > RIP: 0033:0x7f021c80ddc7 > RSP: 002b:00007f0213fbc518 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f021c80ddc7 > RDX: 00007f0213fbc5b0 RSI: 000000004020ae46 RDI: 000000000000000a > RBP: 0000000000000000 R08: 00007f020c1698a0 R09: 0000000000000000 > R10: 00007f020c1698a0 R11: 0000000000000246 R12: 0000000000000006 > R13: 00007f022201c000 R14: 0000000000000002 R15: 0000558c3899e550 > Code: ae fc 01 48 85 c0 75 1c 4c 89 e7 e8 98 de fd ff 48 8b 05 81 ae fc 01 48 85 c0 74 ba 48 85 c3 0f 95 c3 eb b8 48 85 c3 74 e7 eb dd <0f> ff eb 97 4c 89 e7 66 0f 1f 44 00 00 e8 6b de fd ff eb 97 31 > ---[ end trace 16c196134f0dd0a9 ]--- > > After this, there are hundreds of repeats and lots of secondary damage which > kills the host quickly. > > Usually this happens within a few minutes, but sometimes it takes ~half an > hour to reproduce. Because of this, it'd be unpleasant to bisect -- is this > problem already known? > > > Meow! > -- > ⢀⣴⠾⠻⢶⣦⠀ > ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!? > ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din > ⠈⠳⣄⠀⠀⠀⠀ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-21 1:26 ` Wanpeng Li @ 2017-08-21 19:12 ` Adam Borowski 2017-08-21 19:58 ` Radim Krčmář 0 siblings, 1 reply; 29+ messages in thread From: Adam Borowski @ 2017-08-21 19:12 UTC (permalink / raw) To: Wanpeng Li; +Cc: Paolo Bonzini, Radim Krčmář, kvm, linux-kernel On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote: > 2017-08-21 7:13 GMT+08:00 Adam Borowski <kilobyte@angband.pl>: > > Hi! > > I'm afraid I keep getting a quite reliable, but random, splat when running > > KVM: > > I reported something similar before. https://lkml.org/lkml/2017/6/29/64 Your problem seems to require OOM; I don't have any memory pressure at all: running a single 2GB guest while there's nothing big on the host (bloatfox, xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's no memory pressure inside the guest either -- none was Linux (I wanted to test something on hurd, kfreebsd) and I doubt they even got to use all of their frames. Also, it doesn't reproduce for me on 4.12. > > ------------[ cut here ]------------ > > WARNING: CPU: 5 PID: 5826 at arch/x86/kvm/mmu.c:717 mmu_spte_clear_track_bits+0x123/0x170 > > Modules linked in: tun nbd arc4 rtl8xxxu mac80211 cfg80211 rfkill nouveau video ttm > > CPU: 5 PID: 5826 Comm: qemu-system-x86 Not tainted 4.13.0-rc5-vanilla-ubsan-00211-g7f680d7ec315 #1 > > Hardware name: System manufacturer System Product Name/M4A77T, BIOS 2401 05/18/2011 > > task: ffff880207ef0400 task.stack: ffffc900035e4000 > > RIP: 0010:mmu_spte_clear_track_bits+0x123/0x170 > > RSP: 0018:ffffc900035e7ab0 EFLAGS: 00010246 > > RAX: 0000000000000000 RBX: 000000010501cc67 RCX: 0000000000000001 > > RDX: dead0000000000ff RSI: ffff88020e501df8 RDI: 0000000004140700 > > RBP: ffffc900035e7ad8 R08: 0000000000000100 R09: 0000000000000003 > > R10: 0000000000000003 R11: 0000000000000005 R12: 000000000010501c > > R13: ffffea0004140700 R14: ffff88020e1d0000 R15: 0000000000000000 > > FS: 00007f0213fbd700(0000) GS:ffff88022fd40000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000000000000 CR3: 000000022187f000 CR4: 00000000000006e0 > > Call Trace: > > drop_spte+0x26/0x130 > > mmu_page_zap_pte+0xc4/0x160 > > kvm_mmu_prepare_zap_page+0x65/0x660 > > kvm_mmu_invalidate_zap_all_pages+0xc5/0x1f0 > > kvm_mmu_invalidate_zap_pages_in_memslot+0x9/0x10 > > kvm_page_track_flush_slot+0x86/0xd0 > > kvm_arch_flush_shadow_memslot+0x9/0x10 > > __kvm_set_memory_region+0x8fb/0x14f0 > > kvm_set_memory_region+0x2f/0x50 > > kvm_vm_ioctl+0x559/0xcc0 > > ? kvm_vcpu_ioctl+0x171/0x620 > > ? __switch_to+0x30b/0x740 > > do_vfs_ioctl+0xbb/0x8d0 > > ? find_vma+0x23/0x100 > > ? __fget_light+0x94/0x110 > > SyS_ioctl+0x86/0xa0 > > entry_SYSCALL_64_fastpath+0x17/0x98 > > RIP: 0033:0x7f021c80ddc7 > > RSP: 002b:00007f0213fbc518 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f021c80ddc7 > > RDX: 00007f0213fbc5b0 RSI: 000000004020ae46 RDI: 000000000000000a > > RBP: 0000000000000000 R08: 00007f020c1698a0 R09: 0000000000000000 > > R10: 00007f020c1698a0 R11: 0000000000000246 R12: 0000000000000006 > > R13: 00007f022201c000 R14: 0000000000000002 R15: 0000558c3899e550 > > Code: ae fc 01 48 85 c0 75 1c 4c 89 e7 e8 98 de fd ff 48 8b 05 81 ae fc 01 48 85 c0 74 ba 48 85 c3 0f 95 c3 eb b8 48 85 c3 74 e7 eb dd <0f> ff eb 97 4c 89 e7 66 0f 1f 44 00 00 e8 6b de fd ff eb 97 31 > > ---[ end trace 16c196134f0dd0a9 ]--- > > > > After this, there are hundreds of repeats and lots of secondary damage which > > kills the host quickly. > > > > Usually this happens within a few minutes, but sometimes it takes ~half an > > hour to reproduce. Because of this, it'd be unpleasant to bisect -- is this > > problem already known? -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!? ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din ⠈⠳⣄⠀⠀⠀⠀ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-21 19:12 ` Adam Borowski @ 2017-08-21 19:58 ` Radim Krčmář 2017-08-21 22:32 ` Adam Borowski 0 siblings, 1 reply; 29+ messages in thread From: Radim Krčmář @ 2017-08-21 19:58 UTC (permalink / raw) To: Adam Borowski; +Cc: Wanpeng Li, Paolo Bonzini, kvm, linux-kernel 2017-08-21 21:12+0200, Adam Borowski: > On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote: > > 2017-08-21 7:13 GMT+08:00 Adam Borowski <kilobyte@angband.pl>: > > > Hi! > > > I'm afraid I keep getting a quite reliable, but random, splat when running > > > KVM: > > > > I reported something similar before. https://lkml.org/lkml/2017/6/29/64 > > Your problem seems to require OOM; I don't have any memory pressure at all: > running a single 2GB guest while there's nothing big on the host (bloatfox, > xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's > no memory pressure inside the guest either -- none was Linux (I wanted to > test something on hurd, kfreebsd) and I doubt they even got to use all of > their frames. I even tried hurd, but couldn't reproduce ... what is your qemu command line and the output of host's `grep . /sys/module/kvm*/parameters/*`? > Also, it doesn't reproduce for me on 4.12. Great info ... the most suspicious between v4.12 and v4.13-rc5 is the series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more explicit"), does reverting it help? `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5` Thanks. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-21 19:58 ` Radim Krčmář @ 2017-08-21 22:32 ` Adam Borowski 2017-08-23 12:22 ` Paolo Bonzini 0 siblings, 1 reply; 29+ messages in thread From: Adam Borowski @ 2017-08-21 22:32 UTC (permalink / raw) To: Radim Krčmář; +Cc: Wanpeng Li, Paolo Bonzini, kvm, linux-kernel On Mon, Aug 21, 2017 at 09:58:34PM +0200, Radim Krčmář wrote: > 2017-08-21 21:12+0200, Adam Borowski: > > On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote: > > > 2017-08-21 7:13 GMT+08:00 Adam Borowski <kilobyte@angband.pl>: > > > > I'm afraid I keep getting a quite reliable, but random, splat when running > > > > KVM: > > > > > > I reported something similar before. https://lkml.org/lkml/2017/6/29/64 > > > > Your problem seems to require OOM; I don't have any memory pressure at all: > > running a single 2GB guest while there's nothing big on the host (bloatfox, > > xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's > > no memory pressure inside the guest either -- none was Linux (I wanted to > > test something on hurd, kfreebsd) and I doubt they even got to use all of > > their frames. > > I even tried hurd, but couldn't reproduce ... Also happens with a win10 guest, and with multiple Linuxes. > what is your qemu command > line and the output of host's `grep . /sys/module/kvm*/parameters/*`? qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \ -net bridge -net nic \ -drive file="$DISK",cache=writeback,index=0,media=disk,discard=on qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \ -net bridge -net nic \ -drive file="$DISK",cache=unsafe,index=0,media=disk,discard=on,if=virtio,format=raw /sys/module/kvm/parameters/halt_poll_ns:200000 /sys/module/kvm/parameters/halt_poll_ns_grow:2 /sys/module/kvm/parameters/halt_poll_ns_shrink:0 /sys/module/kvm/parameters/ignore_msrs:N /sys/module/kvm/parameters/kvmclock_periodic_sync:Y /sys/module/kvm/parameters/lapic_timer_advance_ns:0 /sys/module/kvm/parameters/min_timer_period_us:500 /sys/module/kvm/parameters/tsc_tolerance_ppm:250 /sys/module/kvm/parameters/vector_hashing:Y /sys/module/kvm_amd/parameters/avic:0 /sys/module/kvm_amd/parameters/nested:1 /sys/module/kvm_amd/parameters/npt:1 /sys/module/kvm_amd/parameters/vls:0 > > Also, it doesn't reproduce for me on 4.12. > > Great info ... the most suspicious between v4.12 and v4.13-rc5 is the > series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more > explicit"), does reverting it help? > > `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5` Alas, doesn't seem to help. I've first installed a Debian stretch guest, the host survived both the installation and subsequent fooling around. But then I started a win10 guest which splatted as soon as the initial screen. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!? ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din ⠈⠳⣄⠀⠀⠀⠀ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-21 22:32 ` Adam Borowski @ 2017-08-23 12:22 ` Paolo Bonzini 2017-08-24 7:43 ` Wanpeng Li 0 siblings, 1 reply; 29+ messages in thread From: Paolo Bonzini @ 2017-08-23 12:22 UTC (permalink / raw) To: Adam Borowski, Radim Krčmář; +Cc: Wanpeng Li, kvm, linux-kernel On 22/08/2017 00:32, Adam Borowski wrote: > On Mon, Aug 21, 2017 at 09:58:34PM +0200, Radim Krčmář wrote: >> 2017-08-21 21:12+0200, Adam Borowski: >>> On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote: >>>> 2017-08-21 7:13 GMT+08:00 Adam Borowski <kilobyte@angband.pl>: >>>>> I'm afraid I keep getting a quite reliable, but random, splat when running >>>>> KVM: >>>> >>>> I reported something similar before. https://lkml.org/lkml/2017/6/29/64 >>> >>> Your problem seems to require OOM; I don't have any memory pressure at all: >>> running a single 2GB guest while there's nothing big on the host (bloatfox, >>> xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's >>> no memory pressure inside the guest either -- none was Linux (I wanted to >>> test something on hurd, kfreebsd) and I doubt they even got to use all of >>> their frames. >> >> I even tried hurd, but couldn't reproduce ... > > Also happens with a win10 guest, and with multiple Linuxes. > >> what is your qemu command >> line and the output of host's `grep . /sys/module/kvm*/parameters/*`? > > qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \ > -net bridge -net nic \ > -drive file="$DISK",cache=writeback,index=0,media=disk,discard=on > > qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \ > -net bridge -net nic \ > -drive file="$DISK",cache=unsafe,index=0,media=disk,discard=on,if=virtio,format=raw > > /sys/module/kvm/parameters/halt_poll_ns:200000 > /sys/module/kvm/parameters/halt_poll_ns_grow:2 > /sys/module/kvm/parameters/halt_poll_ns_shrink:0 > /sys/module/kvm/parameters/ignore_msrs:N > /sys/module/kvm/parameters/kvmclock_periodic_sync:Y > /sys/module/kvm/parameters/lapic_timer_advance_ns:0 > /sys/module/kvm/parameters/min_timer_period_us:500 > /sys/module/kvm/parameters/tsc_tolerance_ppm:250 > /sys/module/kvm/parameters/vector_hashing:Y > /sys/module/kvm_amd/parameters/avic:0 > /sys/module/kvm_amd/parameters/nested:1 > /sys/module/kvm_amd/parameters/npt:1 > /sys/module/kvm_amd/parameters/vls:0 > >>> Also, it doesn't reproduce for me on 4.12. >> >> Great info ... the most suspicious between v4.12 and v4.13-rc5 is the >> series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more >> explicit"), does reverting it help? >> >> `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5` > > Alas, doesn't seem to help. > > I've first installed a Debian stretch guest, the host survived both the > installation and subsequent fooling around. But then I started a win10 > guest which splatted as soon as the initial screen. Can you check if disabling THP on the host also fixes it for you? I would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to identify whether it was caused by a KVM change in 4.13 or something else. Paolo ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-23 12:22 ` Paolo Bonzini @ 2017-08-24 7:43 ` Wanpeng Li 2017-08-25 13:14 ` Adam Borowski 0 siblings, 1 reply; 29+ messages in thread From: Wanpeng Li @ 2017-08-24 7:43 UTC (permalink / raw) To: Paolo Bonzini Cc: Adam Borowski, Radim Krčmář, kvm, linux-kernel 2017-08-23 20:22 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>: > On 22/08/2017 00:32, Adam Borowski wrote: >> On Mon, Aug 21, 2017 at 09:58:34PM +0200, Radim Krčmář wrote: >>> 2017-08-21 21:12+0200, Adam Borowski: >>>> On Mon, Aug 21, 2017 at 09:26:57AM +0800, Wanpeng Li wrote: >>>>> 2017-08-21 7:13 GMT+08:00 Adam Borowski <kilobyte@angband.pl>: >>>>>> I'm afraid I keep getting a quite reliable, but random, splat when running >>>>>> KVM: >>>>> >>>>> I reported something similar before. https://lkml.org/lkml/2017/6/29/64 >>>> >>>> Your problem seems to require OOM; I don't have any memory pressure at all: >>>> running a single 2GB guest while there's nothing big on the host (bloatfox, >>>> xfce, xorg, terminals + some minor junk); 8GB + (untouched) swap. There's >>>> no memory pressure inside the guest either -- none was Linux (I wanted to >>>> test something on hurd, kfreebsd) and I doubt they even got to use all of >>>> their frames. >>> >>> I even tried hurd, but couldn't reproduce ... >> >> Also happens with a win10 guest, and with multiple Linuxes. >> >>> what is your qemu command >>> line and the output of host's `grep . /sys/module/kvm*/parameters/*`? >> >> qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \ >> -net bridge -net nic \ >> -drive file="$DISK",cache=writeback,index=0,media=disk,discard=on >> >> qemu-system-x86_64 -enable-kvm -m 2048 -vga qxl -usbdevice tablet \ >> -net bridge -net nic \ >> -drive file="$DISK",cache=unsafe,index=0,media=disk,discard=on,if=virtio,format=raw >> >> /sys/module/kvm/parameters/halt_poll_ns:200000 >> /sys/module/kvm/parameters/halt_poll_ns_grow:2 >> /sys/module/kvm/parameters/halt_poll_ns_shrink:0 >> /sys/module/kvm/parameters/ignore_msrs:N >> /sys/module/kvm/parameters/kvmclock_periodic_sync:Y >> /sys/module/kvm/parameters/lapic_timer_advance_ns:0 >> /sys/module/kvm/parameters/min_timer_period_us:500 >> /sys/module/kvm/parameters/tsc_tolerance_ppm:250 >> /sys/module/kvm/parameters/vector_hashing:Y >> /sys/module/kvm_amd/parameters/avic:0 >> /sys/module/kvm_amd/parameters/nested:1 >> /sys/module/kvm_amd/parameters/npt:1 >> /sys/module/kvm_amd/parameters/vls:0 >> >>>> Also, it doesn't reproduce for me on 4.12. >>> >>> Great info ... the most suspicious between v4.12 and v4.13-rc5 is the >>> series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more >>> explicit"), does reverting it help? >>> >>> `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5` >> >> Alas, doesn't seem to help. >> >> I've first installed a Debian stretch guest, the host survived both the >> installation and subsequent fooling around. But then I started a win10 >> guest which splatted as soon as the initial screen. > > Can you check if disabling THP on the host also fixes it for you? I > would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to > identify whether it was caused by a KVM change in 4.13 or something else. For the OOM testcase, the splat will disappear if disabling THP. Regards, Wanpeng Li ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-24 7:43 ` Wanpeng Li @ 2017-08-25 13:14 ` Adam Borowski 2017-08-25 13:40 ` Paolo Bonzini 0 siblings, 1 reply; 29+ messages in thread From: Adam Borowski @ 2017-08-25 13:14 UTC (permalink / raw) To: Wanpeng Li; +Cc: Paolo Bonzini, Radim Krčmář, kvm, linux-kernel On Thu, Aug 24, 2017 at 03:43:55PM +0800, Wanpeng Li wrote: > 2017-08-23 20:22 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>: > > On 22/08/2017 00:32, Adam Borowski wrote: > >> On Mon, Aug 21, 2017 at 09:58:34PM +0200, Radim Krčmář wrote: > >>> 2017-08-21 21:12+0200, Adam Borowski: > >>>> Also, it doesn't reproduce for me on 4.12. > >>> > >>> Great info ... the most suspicious between v4.12 and v4.13-rc5 is the > >>> series with dcdca5fed5f6 ("x86: kvm: mmu: make spte mmio mask more > >>> explicit"), does reverting it help? > >>> > >>> `git revert ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5` > >> > >> Alas, doesn't seem to help. > >> > >> I've first installed a Debian stretch guest, the host survived both the > >> installation and subsequent fooling around. But then I started a win10 > >> guest which splatted as soon as the initial screen. > > > > Can you check if disabling THP on the host also fixes it for you? As in: ? echo never >/sys/kernel/mm/transparent_hugepage/enabled echo never >/sys/kernel/mm/transparent_hugepage/defrag Still reproduces, with or without reverting ce00053b1cfca312c22e2a6465451f1862561eab~1..995f00a619584e65e53eff372d9b73b121a7bad5 > > I would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to > > identify whether it was caused by a KVM change in 4.13 or something > > else. I've ran different guests for a couple of hours, no explosions. Thus it looks like updating Cornelia's email address isn't the cause. Too bad, there's 15k commits between 1372324b and 7f680d7ec315. > For the OOM testcase, the splat will disappear if disabling THP. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!? ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din ⠈⠳⣄⠀⠀⠀⠀ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-25 13:14 ` Adam Borowski @ 2017-08-25 13:40 ` Paolo Bonzini 2017-08-27 12:35 ` Adam Borowski 0 siblings, 1 reply; 29+ messages in thread From: Paolo Bonzini @ 2017-08-25 13:40 UTC (permalink / raw) To: Adam Borowski, Wanpeng Li; +Cc: Radim Krčmář, kvm, linux-kernel On 25/08/2017 15:14, Adam Borowski wrote: >>> I would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to >>> identify whether it was caused by a KVM change in 4.13 or something >>> else. > I've ran different guests for a couple of hours, no explosions. Thus it > looks like updating Cornelia's email address isn't the cause. > > Too bad, there's 15k commits between 1372324b and 7f680d7ec315. So: - 4.12 works - 1372324b328cd5dabaef5e345e37ad48c63df2a9 works - 4.13-rc6 fails The next ones to test are going to be c136b84393d4e340e1b53fc7f737dd5827b19ee5 and 4.13-rc1. This can be a starting point for bisection or pointing the finger at someone else. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-25 13:40 ` Paolo Bonzini @ 2017-08-27 12:35 ` Adam Borowski 2017-08-28 15:26 ` Bernhard Held 0 siblings, 1 reply; 29+ messages in thread From: Adam Borowski @ 2017-08-27 12:35 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Wanpeng Li, Radim Krčmář, kvm, linux-kernel On Fri, Aug 25, 2017 at 03:40:50PM +0200, Paolo Bonzini wrote: > On 25/08/2017 15:14, Adam Borowski wrote: > >>> I would also try commit 1372324b328cd5dabaef5e345e37ad48c63df2a9 to > >>> identify whether it was caused by a KVM change in 4.13 or something > >>> else. > > I've ran different guests for a couple of hours, no explosions. Thus it > > looks like updating Cornelia's email address isn't the cause. > > > > Too bad, there's 15k commits between 1372324b and 7f680d7ec315. > > So: > > - 4.12 works > > - 1372324b328cd5dabaef5e345e37ad48c63df2a9 works > > - 4.13-rc6 fails > > The next ones to test are going to be > c136b84393d4e340e1b53fc7f737dd5827b19ee5 and 4.13-rc1. c136b84393d4e340e1b53fc7f737dd5827b19ee5 works 4.13-rc1 works Which actually isn't so surprising, as I pull from linus/master every Sunday night/Monday morning, so I'd would have noticed KVM breakage sooner. 4.13-rc4 works 4.13-rc5 works (but ↓) v4.13-rc5-173-g58d4e450a490 works... uh oh. It shouldn't -- it's merge-base between mainline and what I was running on the initial crash, and I'm sure anything non-mainline I had isn't the culprit. After a couple of hours of running various loads inside and outside KVM, I finally got it to crash. 4.13-rc5 retested fails Crashed only after two hours or so of testing. 4.13-rc4 apparently works It survived several hours of varied tests (like 5 debian-installer runs, a win10 point release upgrade, some hurd package building, openbsd, etc), all while the host was likewise busy. Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5 but I wouldn't bet my life on it. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!? ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din ⠈⠳⣄⠀⠀⠀⠀ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-27 12:35 ` Adam Borowski @ 2017-08-28 15:26 ` Bernhard Held 2017-08-28 16:01 ` Takashi Iwai 2017-08-28 16:56 ` Nadav Amit 0 siblings, 2 replies; 29+ messages in thread From: Bernhard Held @ 2017-08-28 15:26 UTC (permalink / raw) To: Adam Borowski, Paolo Bonzini Cc: Wanpeng Li, Radim Krčmář, kvm, linux-kernel On 08/27/2017 at 02:35 PM, Adam Borowski wrote: > 4.13-rc5 retested fails > Crashed only after two hours or so of testing. > > 4.13-rc4 apparently works > It survived several hours of varied tests (like 5 debian-installer runs, a > win10 point release upgrade, some hurd package building, openbsd, etc), > all while the host was likewise busy. > > Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5 > but I wouldn't bet my life on it. I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure. There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related: $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store() aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl d041353dc98a mm: fix list corruptions on shmem shrinklist af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c b3a81d0841a9 mm: fix KSM data corruption 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem 0a2dd266dd6b mm: make tlb_flush_pending global 56236a59556c mm: refactor TLB gathering API a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible" 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod() 434b06ae23ba test_kmod: fix bug which allows negative values on two config options a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case 75dddef32514 mm: ratelimit PFNs busy info message d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads Any hint on what to test first is welcome! Bernhard ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-28 15:26 ` Bernhard Held @ 2017-08-28 16:01 ` Takashi Iwai 2017-08-28 16:07 ` Bernhard Held 2017-08-28 16:56 ` Nadav Amit 1 sibling, 1 reply; 29+ messages in thread From: Takashi Iwai @ 2017-08-28 16:01 UTC (permalink / raw) To: Bernhard Held Cc: Adam Borowski, Paolo Bonzini, Wanpeng Li, Radim Krčmář, kvm, linux-kernel On Mon, 28 Aug 2017 17:26:05 +0200, Bernhard Held wrote: > > On 08/27/2017 at 02:35 PM, Adam Borowski wrote: > > 4.13-rc5 retested fails > > Crashed only after two hours or so of testing. > > > > 4.13-rc4 apparently works > > It survived several hours of varied tests (like 5 debian-installer runs, a > > win10 point release upgrade, some hurd package building, openbsd, etc), > > all while the host was likewise busy. > > > > Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5 > > but I wouldn't bet my life on it. > > I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure. > > There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related: > > $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit > e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage > f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store() > aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl > d041353dc98a mm: fix list corruptions on shmem shrinklist > af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages > c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c > b3a81d0841a9 mm: fix KSM data corruption > 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem > 0a2dd266dd6b mm: make tlb_flush_pending global > 56236a59556c mm: refactor TLB gathering API > a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible" > 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending > 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending > 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context > 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests > 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod() > 434b06ae23ba test_kmod: fix bug which allows negative values on two config options > a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" > 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case > 75dddef32514 mm: ratelimit PFNs busy info message > d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads > > Any hint on what to test first is welcome! Did you get the crash reliably? I've been struggling how to trigger it efficiently, but currently in vain. The memory pressure isn't a single key to trigger it, as it seems... thanks, Takashi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-28 16:01 ` Takashi Iwai @ 2017-08-28 16:07 ` Bernhard Held 2017-08-28 16:17 ` Takashi Iwai 0 siblings, 1 reply; 29+ messages in thread From: Bernhard Held @ 2017-08-28 16:07 UTC (permalink / raw) To: Takashi Iwai Cc: Adam Borowski, Paolo Bonzini, Wanpeng Li, Radim Krčmář, kvm, linux-kernel On 08/28/2017 at 06:01 PM, Takashi Iwai wrote: > On Mon, 28 Aug 2017 17:26:05 +0200, > Bernhard Held wrote: >> I get crashes with Win10 in kvm >> > Did you get the crash reliably? > I've been struggling how to trigger it efficiently, but currently in > vain. The memory pressure isn't a single key to trigger it, as it > seems... Yes, I get the crash pretty reliable with Win10 in kvm after 10 to 30 minutes. Some workload in Windows seems to be necessary to trigger the bug. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-28 16:07 ` Bernhard Held @ 2017-08-28 16:17 ` Takashi Iwai 0 siblings, 0 replies; 29+ messages in thread From: Takashi Iwai @ 2017-08-28 16:17 UTC (permalink / raw) To: Bernhard Held Cc: Adam Borowski, Paolo Bonzini, Wanpeng Li, Radim Krčmář, kvm, linux-kernel On Mon, 28 Aug 2017 18:07:37 +0200, Bernhard Held wrote: > > On 08/28/2017 at 06:01 PM, Takashi Iwai wrote: > > On Mon, 28 Aug 2017 17:26:05 +0200, > > Bernhard Held wrote: > >> I get crashes with Win10 in kvm > >> > > Did you get the crash reliably? > > I've been struggling how to trigger it efficiently, but currently in > > vain. The memory pressure isn't a single key to trigger it, as it > > seems... > > Yes, I get the crash pretty reliable with Win10 in kvm after 10 to 30 minutes. Some workload in Windows seems to be necessary to trigger the bug. OK, thanks, it's good to know. Unfortunately I have no Windows on my machine, so it doesn't work for me ;) But it implies that rather the workload in VM is likely more essential than the workload on host. Takashi ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-28 15:26 ` Bernhard Held 2017-08-28 16:01 ` Takashi Iwai @ 2017-08-28 16:56 ` Nadav Amit 2017-08-29 9:19 ` Bernhard Held 2017-08-29 12:57 ` Mike Galbraith 1 sibling, 2 replies; 29+ messages in thread From: Nadav Amit @ 2017-08-28 16:56 UTC (permalink / raw) To: Bernhard Held Cc: Adam Borowski, Paolo Bonzini, Wanpeng Li, Radim Krčmář, kvm, linux-kernel Bernhard Held <berny156@gmx.de> wrote: > On 08/27/2017 at 02:35 PM, Adam Borowski wrote: >> 4.13-rc5 retested fails >> Crashed only after two hours or so of testing. >> 4.13-rc4 apparently works >> It survived several hours of varied tests (like 5 debian-installer runs, a >> win10 point release upgrade, some hurd package building, openbsd, etc), >> all while the host was likewise busy. >> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5 >> but I wouldn't bet my life on it. > > I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure. > > There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related: > > $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit > e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage > f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store() > aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl > d041353dc98a mm: fix list corruptions on shmem shrinklist > af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages > c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c > b3a81d0841a9 mm: fix KSM data corruption > 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem > 0a2dd266dd6b mm: make tlb_flush_pending global > 56236a59556c mm: refactor TLB gathering API > a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible" > 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending > 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending > 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context > 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests > 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod() > 434b06ae23ba test_kmod: fix bug which allows negative values on two config options > a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" > 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case > 75dddef32514 mm: ratelimit PFNs busy info message > d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-28 16:56 ` Nadav Amit @ 2017-08-29 9:19 ` Bernhard Held [not found] ` <s5hh8wq8ruy.wl-tiwai@suse.de> 2017-08-29 15:53 ` Nadav Amit 2017-08-29 12:57 ` Mike Galbraith 1 sibling, 2 replies; 29+ messages in thread From: Bernhard Held @ 2017-08-29 9:19 UTC (permalink / raw) To: Nadav Amit Cc: Adam Borowski, Paolo Bonzini, Wanpeng Li, Radim Krčmář, kvm, linux-kernel On 08/28/2017 at 06:56 PM, Nadav Amit wrote: > Bernhard Held <berny156@gmx.de> wrote: > >> On 08/27/2017 at 02:35 PM, Adam Borowski wrote: >>> 4.13-rc5 retested fails >>> Crashed only after two hours or so of testing. >>> 4.13-rc4 apparently works >>> It survived several hours of varied tests (like 5 debian-installer runs, a >>> win10 point release upgrade, some hurd package building, openbsd, etc), >>> all while the host was likewise busy. >>> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5 >>> but I wouldn't bet my life on it. >> >> I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure. >> >> There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related: >> >> $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit >> e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage >> f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store() >> aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl >> d041353dc98a mm: fix list corruptions on shmem shrinklist >> af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages >> c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c >> b3a81d0841a9 mm: fix KSM data corruption >> 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem >> 0a2dd266dd6b mm: make tlb_flush_pending global >> 56236a59556c mm: refactor TLB gathering API >> a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible" >> 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending >> 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending >> 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context >> 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests >> 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod() >> 434b06ae23ba test_kmod: fix bug which allows negative values on two config options >> a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" >> 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case >> 75dddef32514 mm: ratelimit PFNs busy info message >> d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . Amit, thanks for your courage to expose your patch! I'm more and more confident that aac2fea94f7a is the culprit. Maybe it just accelerates the triggering of the splash. To be more sure the kernel needs to be tested for a couple of days. It would be great if others could assist in testing aac2fea94f7a. Have fun, Bernhard ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <s5hh8wq8ruy.wl-tiwai@suse.de>]
* Re: kvm splat in mmu_spte_clear_track_bits [not found] ` <s5hh8wq8ruy.wl-tiwai@suse.de> @ 2017-08-29 12:59 ` Adam Borowski 2017-08-29 14:09 ` Andrea Arcangeli 0 siblings, 1 reply; 29+ messages in thread From: Adam Borowski @ 2017-08-29 12:59 UTC (permalink / raw) To: Takashi Iwai Cc: Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, Linus Torvalds, kvm, linux-kernel On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote: > [Put more people to Cc, sorry for growing too much...] We're all interested in 4.13.0 not crashing on us, so that's ok. > On Tue, 29 Aug 2017 11:19:13 +0200, > Bernhard Held wrote: > > > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote: > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . > > > > Amit, thanks for your courage to expose your patch! > > > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it > > just accelerates the triggering of the splash. To be more sure the > > kernel needs to be tested for a couple of days. It would be great if > > others could assist in testing aac2fea94f7a. > > I'm testing with the revert for a while and it seems working. With nothing but aac2fea94f7a reverted, no explosions for me either. -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!? ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din ⠈⠳⣄⠀⠀⠀⠀ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 12:59 ` Adam Borowski @ 2017-08-29 14:09 ` Andrea Arcangeli 2017-08-29 16:10 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Andrea Arcangeli @ 2017-08-29 14:09 UTC (permalink / raw) To: Adam Borowski Cc: Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, Linus Torvalds, kvm, linux-kernel, Michal Hocko Hello, On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote: > On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote: > > [Put more people to Cc, sorry for growing too much...] > > We're all interested in 4.13.0 not crashing on us, so that's ok. > > > On Tue, 29 Aug 2017 11:19:13 +0200, > > Bernhard Held wrote: > > > > > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote: > > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . > > > > > > Amit, thanks for your courage to expose your patch! > > > > > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it > > > just accelerates the triggering of the splash. To be more sure the > > > kernel needs to be tested for a couple of days. It would be great if > > > others could assist in testing aac2fea94f7a. > > > > I'm testing with the revert for a while and it seems working. > > With nothing but aac2fea94f7a reverted, no explosions for me either. The aforementioned commit has 3 bugs. 1) mmu_notifier_invalidate_range cannot be used in replacement of mmu_notifier_invalidate_range_start/end. For KVM mmu_notifier_invalidate_range is a noop and rightfully so. A MMU notifier implementation has to implement either ->invalidate_range method or the invalidate_range_start/end methods, not both. And if you implement invalidate_range_start/end like KVM is forced to do, calling mmu_notifier_invalidate_range in common code is a noop for KVM. For those MMU notifiers that can get away only implementing ->invalidate_range, the ->invalidate_range is implicitly called by mmu_notifier_invalidate_range_end(). And only those secondary MMUs that share the same pagetable with the primary MMU (like AMD iommuv2) can get away only implementing ->invalidate_range. So all cases (THP on/off) are broken right now. To fix this is enough to replace mmu_notifier_invalidate_range with mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either that or call multiple mmu_notifier_invalidate_page like before. 2) address + (1UL << compound_order(page) is buggy, it should be PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not 512. 3) The whole invalidate_range thing was an attempt to call a single invalidate while walking multiple 4k ptes that maps the same THP (after a pmd virtual split without physical compound page THP split). It's unclear if the rmap_walk will always provide an address that is 2M aligned as parameter to try_to_unmap_one, in presence of THP. I think it needs also an address &= (PAGE_SIZE << compound_order(page)) - 1 to be safe. The other bug where you can reproduce the same corruption with OOM is unrelated and caused by the OOM reaper. OOM reaper was even corrupting data if a task was writing to disk and stuck in OOM in write() syscall or async io write. To fix the KVM corruption in the OOM reaper, it needs to call mmu_notifier_invalidate_start/end around oom_kill.c:unmap_page_range. This additional mmu_notifier_invalidate_start will not be good for the OOM reaper because it's yet another case (like the mmap_sem for writing) that will prevent the OOM reaper to run, so hindering its ability to hide XFS OOM deadlocks, and making those resurface. Not in KVM case because we use a spinlock to serialize against the secondary MMU activity and the KVM critical section under spinlock isn't going to allocate memory, but range_start can schedule or block on slow hardware where the secondary MMU is accessed through PCI (not KVM case). My preference is still to make the OOM reaper a config option and let it grow into the VM at zero cost if disabled, while at the same time having the option to keep the VM simpler and spend the time fixing the filesystem bugs instead (while still being able to reproduce them more easily with OOM reaper disabled). Thanks, Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 14:09 ` Andrea Arcangeli @ 2017-08-29 16:10 ` Linus Torvalds 2017-08-29 18:28 ` Jerome Glisse 2017-08-29 18:34 ` Jerome Glisse 2017-08-30 8:19 ` Michal Hocko 2 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2017-08-29 16:10 UTC (permalink / raw) To: Andrea Arcangeli Cc: Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, kvm, linux-kernel, Michal Hocko On Tue, Aug 29, 2017 at 7:09 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > Hello, > > On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote: >> On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote: >> > [Put more people to Cc, sorry for growing too much...] >> >> We're all interested in 4.13.0 not crashing on us, so that's ok. >> >> > On Tue, 29 Aug 2017 11:19:13 +0200, >> > Bernhard Held wrote: >> > > >> > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote: >> > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . >> > > >> > > Amit, thanks for your courage to expose your patch! >> > > >> > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it >> > > just accelerates the triggering of the splash. To be more sure the >> > > kernel needs to be tested for a couple of days. It would be great if >> > > others could assist in testing aac2fea94f7a. >> > >> > I'm testing with the revert for a while and it seems working. >> >> With nothing but aac2fea94f7a reverted, no explosions for me either. > > The aforementioned commit has 3 bugs. Yes. I'm reverting it from my tree. We should really *really* just tell the stupid MMU notifier users that they can't sleep. The MMU notifiers are not going to destroy our VM layer. I hate the damn crap, and this kind of garbage is an example of why. So Andrew - please stop taking "MMU notifier can sleep" patches. Start taking patches that fix the users, not break the VM. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 16:10 ` Linus Torvalds @ 2017-08-29 18:28 ` Jerome Glisse 0 siblings, 0 replies; 29+ messages in thread From: Jerome Glisse @ 2017-08-29 18:28 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, kvm, linux-kernel, Michal Hocko On Tue, Aug 29, 2017 at 09:10:59AM -0700, Linus Torvalds wrote: > On Tue, Aug 29, 2017 at 7:09 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > Hello, > > > > On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote: > >> On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote: > >> > [Put more people to Cc, sorry for growing too much...] > >> > >> We're all interested in 4.13.0 not crashing on us, so that's ok. > >> > >> > On Tue, 29 Aug 2017 11:19:13 +0200, > >> > Bernhard Held wrote: > >> > > > >> > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote: > >> > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . > >> > > > >> > > Amit, thanks for your courage to expose your patch! > >> > > > >> > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it > >> > > just accelerates the triggering of the splash. To be more sure the > >> > > kernel needs to be tested for a couple of days. It would be great if > >> > > others could assist in testing aac2fea94f7a. > >> > > >> > I'm testing with the revert for a while and it seems working. > >> > >> With nothing but aac2fea94f7a reverted, no explosions for me either. > > > > The aforementioned commit has 3 bugs. > > Yes. I'm reverting it from my tree. > > We should really *really* just tell the stupid MMU notifier users that > they can't sleep. There is no way around sleeping if we ever want to support thing like GPU. To invalidate page table on GPU you need to schedule commands to do so on GPU command queue and wait for the GPU to signal that it has invalidated its page table/tlb and caches. We had this discussion before. Either we want to support all the new fancy GPGPU, AI and all the API they rely on or we should tell them sorry guys not on linux. > > The MMU notifiers are not going to destroy our VM layer. I hate the > damn crap, and this kind of garbage is an example of why. Issue here is that nobody calls mmu_notifier_invalidate_range_start/end() hence why people relied on invalidate_range() to not sleep like start/end Now we can make the decission that start/end can sleep while the range can't but then we also need to make sure that range_start/end is always called. Cheers, Jérôme ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 14:09 ` Andrea Arcangeli 2017-08-29 16:10 ` Linus Torvalds @ 2017-08-29 18:34 ` Jerome Glisse 2017-08-29 19:06 ` Linus Torvalds 2017-08-30 8:19 ` Michal Hocko 2 siblings, 1 reply; 29+ messages in thread From: Jerome Glisse @ 2017-08-29 18:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, Linus Torvalds, kvm, linux-kernel, Michal Hocko On Tue, Aug 29, 2017 at 04:09:24PM +0200, Andrea Arcangeli wrote: > Hello, > > On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote: > > On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote: > > > [Put more people to Cc, sorry for growing too much...] > > > > We're all interested in 4.13.0 not crashing on us, so that's ok. > > > > > On Tue, 29 Aug 2017 11:19:13 +0200, > > > Bernhard Held wrote: > > > > > > > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote: > > > > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . > > > > > > > > Amit, thanks for your courage to expose your patch! > > > > > > > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it > > > > just accelerates the triggering of the splash. To be more sure the > > > > kernel needs to be tested for a couple of days. It would be great if > > > > others could assist in testing aac2fea94f7a. > > > > > > I'm testing with the revert for a while and it seems working. > > > > With nothing but aac2fea94f7a reverted, no explosions for me either. > > The aforementioned commit has 3 bugs. > > 1) mmu_notifier_invalidate_range cannot be used in replacement of > mmu_notifier_invalidate_range_start/end. For KVM > mmu_notifier_invalidate_range is a noop and rightfully so. A MMU > notifier implementation has to implement either > ->invalidate_range method or the invalidate_range_start/end > methods, not both. And if you implement invalidate_range_start/end > like KVM is forced to do, calling mmu_notifier_invalidate_range in > common code is a noop for KVM. > > For those MMU notifiers that can get away only implementing > ->invalidate_range, the ->invalidate_range is implicitly called by > mmu_notifier_invalidate_range_end(). And only those secondary MMUs > that share the same pagetable with the primary MMU (like AMD > iommuv2) can get away only implementing ->invalidate_range. > > So all cases (THP on/off) are broken right now. > > To fix this is enough to replace mmu_notifier_invalidate_range with > mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either > that or call multiple mmu_notifier_invalidate_page like before. Kirill did regress invalidate_page as it use to be call outside the spinlock and now it is call inside the spinlock thus reverting will introduce back a regression. You can refer to the thread about it: https://lkml.org/lkml/2017/8/9/418 Jérôme ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 18:34 ` Jerome Glisse @ 2017-08-29 19:06 ` Linus Torvalds 2017-08-29 19:13 ` Jerome Glisse 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2017-08-29 19:06 UTC (permalink / raw) To: Jerome Glisse Cc: Andrea Arcangeli, Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, kvm, linux-kernel, Michal Hocko On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse <jglisse@redhat.com> wrote: > > Kirill did regress invalidate_page as it use to be call outside the > spinlock and now it is call inside the spinlock thus reverting will > introduce back a regression. Honestly, this MMU notifier thing has been nothing but a badly designed mistake from beginning to end, and bad rules for what can sleep and what can not are one fundamental problem. There are fundamentally two levels of VM locking, and those two levels are not going to go away, and we're not budging on them: - there's the "virtual address" level, which can block. We have a nice mmap_semaphore, and we guarantee that it's held for writing for all changes to the virtual memory layout This is the "mmap/munmap" kind of granularity. The mmu callbacks at *this* level are fine to block. - then there is the "page level" VM handling, and honestly, that *fundamentally* uses a spinlock. If we look at a particular page, that page is meaningless without the lock. Really. I honestly believe that any MMU callback at this level needs to be atomic. Some of the absolutely *have* to be (that "change_pte", for example). In that second case, we might have a "begin/end" surrounding the actual page table walk. And that might sleep, but then it *fundamentally* cannot actually be able some particular single page or stable range. Because without the page table spinlock, no such stability exists. It's purely a "we are not going to start looking at this range" kind of thing. I really don't understand why the nVidia crap cannot follow those simple rules. Because either (a) you're working with virtual addresses, and you should be able to work on that virtual layer (b) you're actually working with physical pages, and you can just hold on to those physical pages yourself. I really detest our MMU callbacks. I shouldn't have allowed them to be merged. And I definitely shoul.dn't allow them to screw up our VM layer. But we have them, and we should work at making sure people do sane things. And yes, those sane things may include (a) guaranteeing that the start/end range calls are always done around the actual locked region. (b) adding a ton of validation so that people *see* then they break the rules. Even when they don't use some random notifier crud. That (b) may involve adding a number of "might_sleep()" calls (not deep in the notifiers themselves, but in the actual wrapper functions even when notifiers are compiled out entirely!), but also adding calls to validate those kinds of "you can't call mmu_notifier_invalidate_page() without having first called mmu_notifier_invalidate_range_start() in a sleepable context". But (b) definitely should also be a very real onus on the mmu notifiers themselves. No way can we sleep when we're traversing page tables. We hold a page table lock. We can sleep before and after, but not during actual page traversal. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 19:06 ` Linus Torvalds @ 2017-08-29 19:13 ` Jerome Glisse 2017-08-29 19:38 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Jerome Glisse @ 2017-08-29 19:13 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, kvm, linux-kernel, Michal Hocko On Tue, Aug 29, 2017 at 12:06:42PM -0700, Linus Torvalds wrote: > On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse <jglisse@redhat.com> wrote: > > > > Kirill did regress invalidate_page as it use to be call outside the > > spinlock and now it is call inside the spinlock thus reverting will > > introduce back a regression. > > Honestly, this MMU notifier thing has been nothing but a badly > designed mistake from beginning to end, and bad rules for what can > sleep and what can not are one fundamental problem. > > There are fundamentally two levels of VM locking, and those two levels > are not going to go away, and we're not budging on them: > > - there's the "virtual address" level, which can block. We have a > nice mmap_semaphore, and we guarantee that it's held for writing for > all changes to the virtual memory layout > > This is the "mmap/munmap" kind of granularity. The mmu callbacks at > *this* level are fine to block. > > - then there is the "page level" VM handling, and honestly, that > *fundamentally* uses a spinlock. If we look at a particular page, that > page is meaningless without the lock. Really. > > I honestly believe that any MMU callback at this level needs to be > atomic. Some of the absolutely *have* to be (that "change_pte", for > example). > > In that second case, we might have a "begin/end" surrounding the > actual page table walk. And that might sleep, but then it > *fundamentally* cannot actually be able some particular single page > or stable range. Because without the page table spinlock, no such > stability exists. It's purely a "we are not going to start looking at > this range" kind of thing. > > I really don't understand why the nVidia crap cannot follow those > simple rules. Because either > > (a) you're working with virtual addresses, and you should be able to > work on that virtual layer > > (b) you're actually working with physical pages, and you can just > hold on to those physical pages yourself. > > I really detest our MMU callbacks. I shouldn't have allowed them to be > merged. And I definitely shoul.dn't allow them to screw up our VM > layer. > > But we have them, and we should work at making sure people do sane things. > > And yes, those sane things may include > > (a) guaranteeing that the start/end range calls are always done > around the actual locked region. > > (b) adding a ton of validation so that people *see* then they break > the rules. Even when they don't use some random notifier crud. > > That (b) may involve adding a number of "might_sleep()" calls (not > deep in the notifiers themselves, but in the actual wrapper functions > even when notifiers are compiled out entirely!), but also adding calls > to validate those kinds of "you can't call > mmu_notifier_invalidate_page() without having first called > mmu_notifier_invalidate_range_start() in a sleepable context". > > But (b) definitely should also be a very real onus on the mmu > notifiers themselves. No way can we sleep when we're traversing page > tables. We hold a page table lock. We can sleep before and after, but > not during actual page traversal. Yes and i am fine with page traversal being under spinlock and not being able to sleep during that. I agree doing otherwise would be insane. It is just that the existing behavior of try_to_unmap_one() and page_mkclean_one() have been broken and that no mmu_notifier calls were added around the lock section. I sent a patch that properly compute the range to invalidate and move to invalidate_range() but is lacking the invalidate_range_start()/ end() so i am gonna respin that with range_start/end bracketing and assume the worse for the range of address. Cheers, Jérôme ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 19:13 ` Jerome Glisse @ 2017-08-29 19:38 ` Linus Torvalds 2017-08-29 20:49 ` Andrea Arcangeli 0 siblings, 1 reply; 29+ messages in thread From: Linus Torvalds @ 2017-08-29 19:38 UTC (permalink / raw) To: Jerome Glisse Cc: Andrea Arcangeli, Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, kvm, linux-kernel, Michal Hocko On Tue, Aug 29, 2017 at 12:13 PM, Jerome Glisse <jglisse@redhat.com> wrote: > > Yes and i am fine with page traversal being under spinlock and not > being able to sleep during that. I agree doing otherwise would be > insane. It is just that the existing behavior of try_to_unmap_one() > and page_mkclean_one() have been broken and that no mmu_notifier > calls were added around the lock section. Yeah, I'm actually surprised that ever worked. I'm surprised that try_to_unmap_one didn't hold any locks earlier. In fact, I think at least some of them *did* already hold the page table locks: ptep_clear_flush_young_notify() and friends very much should have always held them. So it's literally just that mmu_notifier_invalidate_page() call that used to be outside all the locks, but honestly, I think that was always a bug. It means that you got notified of the page removal *after* the page was already gone and all locks had been released, so a completely *different* page could already have been mapped to that address. So I think the old code was always broken exactly because the callback wasn't serialized with the actual action. > I sent a patch that properly compute the range to invalidate and move > to invalidate_range() but is lacking the invalidate_range_start()/ > end() so i am gonna respin that with range_start/end bracketing and > assume the worse for the range of address. So surrounding it with start/end _should_ make KVM happy. KVM people, can you confirm? But I do note that there's a number of other users of that "invalidate_page" callback. I think ib_umem_notifier_invalidate_page() the exact same blocking issue, but changing to range_start/end should be good there too. amdgpu_mn_invalidate_page() and the xen/gntdev also seem to be happy being replaced with start/end. In fact, I'm wondering if this actually means that we could get rid of mmu_notifier_invalidate_page() entirely. There's only a couple of callers, and the other one seems to be fs/dax.c, and it actually seems to have the exact same issue that the try_to_unmap_one() code had: it tried to invalidate an address too late - by the time it was called, the page gad already been cleaned and locks had been released. So the more I look at that "turn mmu_notifier_invalidate_page() into invalidate_range_start/end()" the more I think that's fundamentally the right thing to do. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 19:38 ` Linus Torvalds @ 2017-08-29 20:49 ` Andrea Arcangeli 2017-08-29 20:59 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Andrea Arcangeli @ 2017-08-29 20:49 UTC (permalink / raw) To: Linus Torvalds Cc: Jerome Glisse, Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, kvm, linux-kernel, Michal Hocko Hello Linus, On Tue, Aug 29, 2017 at 12:38:43PM -0700, Linus Torvalds wrote: > On Tue, Aug 29, 2017 at 12:13 PM, Jerome Glisse <jglisse@redhat.com> wrote: > > > > Yes and i am fine with page traversal being under spinlock and not > > being able to sleep during that. I agree doing otherwise would be > > insane. It is just that the existing behavior of try_to_unmap_one() > > and page_mkclean_one() have been broken and that no mmu_notifier > > calls were added around the lock section. > > Yeah, I'm actually surprised that ever worked. I'm surprised that > try_to_unmap_one didn't hold any locks earlier. > > In fact, I think at least some of them *did* already hold the page > table locks: ptep_clear_flush_young_notify() and friends very much > should have always held them. > > So it's literally just that mmu_notifier_invalidate_page() call that > used to be outside all the locks, but honestly, I think that was > always a bug. It means that you got notified of the page removal > *after* the page was already gone and all locks had been released, so > a completely *different* page could already have been mapped to that > address. > > So I think the old code was always broken exactly because the callback > wasn't serialized with the actual action. That's a very interesting point indeed. I thought the old code was safe because even if the secondary MMU view keeps crunching on the old page for a little while after the primary MMU got a new page mapped, you cannot measure it. If you cannot measure it, it doesn't exist? I.e. undefined. The same runtime you'd get by letting the guest mode crunch on the old page for a little while after the primary MMU is already computing on a brand new page, could materialize if the guest mode CPU just got a turbo boost and computed faster before the ->invalidate_page was run inside the PT lock. If the page is swapped-in and is the same page that was unmapped by try_to_unmap_one, then the primary MMU will return using the same page that the secondary MMU was still using and it'll even be coherent (swapin is probably prevented by the page lock, but still even if it happens it doesn't look an issue). If the page is suddenly replaced while vcpu is in guest mode, no coherency could be provided anyway if the guest mode wasn't serialize by something other than the PT lock. I'll keep thinking about it, this is a quick answer. On a side note, the other major requirement for the code not to have been always broken before is that the caller must hold a refcount: the put_page executed by try_to_unmap_one before calling mmu_notifier_invalidate_page cannot free the page or it's unsafe. This requirement also goes away if using range_start/end. > > I sent a patch that properly compute the range to invalidate and move > > to invalidate_range() but is lacking the invalidate_range_start()/ > > end() so i am gonna respin that with range_start/end bracketing and > > assume the worse for the range of address. > > So surrounding it with start/end _should_ make KVM happy. > > KVM people, can you confirm? Yes that always works as far as I can tell. For our current problem using start/end like you suggested would definitely make KVM happy and it is obviously safe. It would also work to teach page_vma_mapped_walk(&pvmw) to restart after dropping the lock with page_vma_mapped_walk_done() to call mmu_notifier_invalidate_page outside the lock. If such an alternative would be safe however, entirely depends if it was always broken before which again is a very good point to think about. The common case for such function is a single pte being unmapped. > > But I do note that there's a number of other users of that > "invalidate_page" callback. > > I think ib_umem_notifier_invalidate_page() the exact same blocking > issue, but changing to range_start/end should be good there too. > > amdgpu_mn_invalidate_page() and the xen/gntdev also seem to be happy > being replaced with start/end. > > In fact, I'm wondering if this actually means that we could get rid of > mmu_notifier_invalidate_page() entirely. There's only a couple of > callers, and the other one seems to be fs/dax.c, and it actually seems > to have the exact same issue that the try_to_unmap_one() code had: it > tried to invalidate an address too late - by the time it was called, > the page gad already been cleaned and locks had been released. > > So the more I look at that "turn mmu_notifier_invalidate_page() into > invalidate_range_start/end()" the more I think that's fundamentally > the right thing to do. mmu_notifier_invalidate_page exists purely as an optimized version of mmu_notifier_invalidate_range_start/end. Furthermore when it was always run inside the PT lock it was quite handy to replace a ptep_clear_flush into a ptep_clear_flush_notify and be done with it. It also requires a single branch when the MMU notifier is unarmed. mmu_notifier_invalidate_range_start has to first block all secondary MMU page faults, and invalidate the secondary MMU _before_ the pages are freed. mmu_notifier_invalidate_range_end unblocks the secondary MMU page faults after the primary MMU has been invalidated too (i.e. after zapping the ptes). mmu_notifier_invalidate_page has the advantage that it takes the secondary MMU KVM srcu and spinlock a single time. Thanks! Andrea ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 20:49 ` Andrea Arcangeli @ 2017-08-29 20:59 ` Linus Torvalds 0 siblings, 0 replies; 29+ messages in thread From: Linus Torvalds @ 2017-08-29 20:59 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jerome Glisse, Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, kvm, linux-kernel, Michal Hocko On Tue, Aug 29, 2017 at 1:49 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > mmu_notifier_invalidate_page has the advantage that it takes the > secondary MMU KVM srcu and spinlock a single time. Note that that isn't actually all that much of an advantage - it turns out that a lot of users have "range_start", but not "range_end". So in a lot of cases, the "range_start/end()" seems to be exactly as expensive as just the single "page()" call, simply because the code didn't really need the whole range, it only wanted to make sure it was invalidating its data before the range got modified. KVM ends up doing a partial case of that optimization too: it doesn't do the srcu lock in the end case, for example. It does want to keep the sequence numbers for the end case, but that's fairly cheap. So I'd much rather have the simpler rules than have duplicated interfaces for some very dubious performance advantage. Linus ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 14:09 ` Andrea Arcangeli 2017-08-29 16:10 ` Linus Torvalds 2017-08-29 18:34 ` Jerome Glisse @ 2017-08-30 8:19 ` Michal Hocko 2 siblings, 0 replies; 29+ messages in thread From: Michal Hocko @ 2017-08-30 8:19 UTC (permalink / raw) To: Andrea Arcangeli Cc: Adam Borowski, Takashi Iwai, Bernhard Held, Nadav Amit, Paolo Bonzini, Wanpeng Li, Radim Krčmář, Joerg Roedel, Kirill A. Shutemov, Andrew Morton, Linus Torvalds, kvm, linux-kernel On Tue 29-08-17 16:09:24, Andrea Arcangeli wrote: [...] > The other bug where you can reproduce the same corruption with OOM is > unrelated and caused by the OOM reaper. OOM reaper was even corrupting > data if a task was writing to disk and stuck in OOM in write() syscall > or async io write. > > To fix the KVM corruption in the OOM reaper, it needs to call > mmu_notifier_invalidate_start/end around > oom_kill.c:unmap_page_range. This additional > mmu_notifier_invalidate_start will not be good for the OOM reaper > because it's yet another case (like the mmap_sem for writing) that > will prevent the OOM reaper to run, so hindering its ability to hide > XFS OOM deadlocks, and making those resurface. Not in KVM case because > we use a spinlock to serialize against the secondary MMU activity and > the KVM critical section under spinlock isn't going to allocate > memory, but range_start can schedule or block on slow hardware where > the secondary MMU is accessed through PCI (not KVM case). I am not really familiar with mmu notifiers and what they can actually do. But from what you wrote above it is indeed not very safe to call them from the oom reaper. So I will prepare and post a patch to disable the reaper when mm_has_notifiers(). Thanks for pointing this out. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-29 9:19 ` Bernhard Held [not found] ` <s5hh8wq8ruy.wl-tiwai@suse.de> @ 2017-08-29 15:53 ` Nadav Amit 1 sibling, 0 replies; 29+ messages in thread From: Nadav Amit @ 2017-08-29 15:53 UTC (permalink / raw) To: Bernhard Held Cc: Adam Borowski, Paolo Bonzini, Wanpeng Li, Radim Krčmář, kvm, linux-kernel Bernhard Held <berny156@gmx.de> wrote: > On 08/28/2017 at 06:56 PM, Nadav Amit wrote: >> Bernhard Held <berny156@gmx.de> wrote: >>> On 08/27/2017 at 02:35 PM, Adam Borowski wrote: >>>> 4.13-rc5 retested fails >>>> Crashed only after two hours or so of testing. >>>> 4.13-rc4 apparently works >>>> It survived several hours of varied tests (like 5 debian-installer runs, a >>>> win10 point release upgrade, some hurd package building, openbsd, etc), >>>> all while the host was likewise busy. >>>> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5 >>>> but I wouldn't bet my life on it. >>> >>> I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure. >>> >>> There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related: >>> >>> $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit >>> e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage >>> f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store() >>> aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl >>> d041353dc98a mm: fix list corruptions on shmem shrinklist >>> af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages >>> c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c >>> b3a81d0841a9 mm: fix KSM data corruption >>> 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem >>> 0a2dd266dd6b mm: make tlb_flush_pending global >>> 56236a59556c mm: refactor TLB gathering API >>> a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible" >>> 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending >>> 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending >>> 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context >>> 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests >>> 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod() >>> 434b06ae23ba test_kmod: fix bug which allows negative values on two config options >>> a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" >>> 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case >>> 75dddef32514 mm: ratelimit PFNs busy info message >>> d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads >> Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . > > Amit, thanks for your courage to expose your patch! Just for the record, aac2fea94f7a is not mine (some others are). > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it just accelerates the triggering of the splash. To be more sure the kernel needs to be tested for a couple of days. It would be great if others could assist in testing aac2fea94f7a. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: kvm splat in mmu_spte_clear_track_bits 2017-08-28 16:56 ` Nadav Amit 2017-08-29 9:19 ` Bernhard Held @ 2017-08-29 12:57 ` Mike Galbraith 1 sibling, 0 replies; 29+ messages in thread From: Mike Galbraith @ 2017-08-29 12:57 UTC (permalink / raw) To: Nadav Amit, Bernhard Held Cc: Adam Borowski, Paolo Bonzini, Wanpeng Li, Radim Krčmář, kvm, linux-kernel, Kirill A. Shutemov On Mon, 2017-08-28 at 09:56 -0700, Nadav Amit wrote: > Bernhard Held <berny156@gmx.de> wrote: > > > On 08/27/2017 at 02:35 PM, Adam Borowski wrote: > >> 4.13-rc5 retested fails > >> Crashed only after two hours or so of testing. > >> 4.13-rc4 apparently works > >> It survived several hours of varied tests (like 5 debian-installer runs, a > >> win10 point release upgrade, some hurd package building, openbsd, etc), > >> all while the host was likewise busy. > >> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5 > >> but I wouldn't bet my life on it. > > > > I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure. > > > > There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related: > > > > $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit > > e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage > > f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store() > > aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl > > d041353dc98a mm: fix list corruptions on shmem shrinklist > > af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages > > c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c > > b3a81d0841a9 mm: fix KSM data corruption > > 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem > > 0a2dd266dd6b mm: make tlb_flush_pending global > > 56236a59556c mm: refactor TLB gathering API > > a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible" > > 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending > > 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending > > 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context > > 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests > > 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod() > > 434b06ae23ba test_kmod: fix bug which allows negative values on two config options > > a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" > > 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case > > 75dddef32514 mm: ratelimit PFNs busy info message > > d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads > > Don’t blame me for the TLB stuff... My money is on aac2fea94f7a . You may be onto something. FWIW, with an RT host/guest, I reproduced the problem yesterday in fairly short order, but today, with that commit reverted, and pushing markedly harder, nada. (hohum, intermittent bugs tend to do that, they're particularly fond of showing up about 10 seconds after you report them dead...9..8..7;) A colleague suggested going back to mmu_notifier_invalidate_page(), which I'm going to try shortly (hopefully noticing absolutely nothing the least bit 'interesting'), but first, I'm gonna CC the author of that _maybe_ culprit patch. -Mike ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2017-08-30 8:19 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-20 23:13 kvm splat in mmu_spte_clear_track_bits Adam Borowski 2017-08-21 1:26 ` Wanpeng Li 2017-08-21 19:12 ` Adam Borowski 2017-08-21 19:58 ` Radim Krčmář 2017-08-21 22:32 ` Adam Borowski 2017-08-23 12:22 ` Paolo Bonzini 2017-08-24 7:43 ` Wanpeng Li 2017-08-25 13:14 ` Adam Borowski 2017-08-25 13:40 ` Paolo Bonzini 2017-08-27 12:35 ` Adam Borowski 2017-08-28 15:26 ` Bernhard Held 2017-08-28 16:01 ` Takashi Iwai 2017-08-28 16:07 ` Bernhard Held 2017-08-28 16:17 ` Takashi Iwai 2017-08-28 16:56 ` Nadav Amit 2017-08-29 9:19 ` Bernhard Held [not found] ` <s5hh8wq8ruy.wl-tiwai@suse.de> 2017-08-29 12:59 ` Adam Borowski 2017-08-29 14:09 ` Andrea Arcangeli 2017-08-29 16:10 ` Linus Torvalds 2017-08-29 18:28 ` Jerome Glisse 2017-08-29 18:34 ` Jerome Glisse 2017-08-29 19:06 ` Linus Torvalds 2017-08-29 19:13 ` Jerome Glisse 2017-08-29 19:38 ` Linus Torvalds 2017-08-29 20:49 ` Andrea Arcangeli 2017-08-29 20:59 ` Linus Torvalds 2017-08-30 8:19 ` Michal Hocko 2017-08-29 15:53 ` Nadav Amit 2017-08-29 12:57 ` Mike Galbraith
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).