All of lore.kernel.org
 help / color / mirror / Atom feed
* Potential bug in TDP MMU
@ 2021-11-29 21:44 Ignat Korchagin
  2021-11-30  9:29 ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Ignat Korchagin @ 2021-11-29 21:44 UTC (permalink / raw)
  To: pbonzini, kvm; +Cc: stevensd, kernel-team

Hello,

We have recently started to evaluate 5.15.y kernel series and here is
what we occasionally get on kernel 5.15.5:

[177243.621744][T1995658] WARNING: CPU: 7 PID: 1995658 at
arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
[177243.647435][T1995658] Modules linked in: xt_hashlimit xt_connlimit
nf_conncount ip_set_hash_netport xt_length esp4 sit ipip tunnel4
nft_numgen nft_ct ip_gre gre xfrm_user xfrm_algo tcp_diag udp_diag
inet_diag fou6 fou ip6_tunnel tunnel6 ip_tunnel ip6_udp_tunnel
udp_tunnel cls_bpf tls xt_NFLOG xt_statistic nft_compat veth tun
overlay macvlan sch_ingress raid0 md_mod essiv dm_crypt trusted
asn1_encoder tee dm_mod dax nfnetlink_log nft_log nft_limit
nft_counter nf_tables ip6table_nat ip6table_mangle ip6table_security
ip6table_raw ip6table_filter ip6_tables xt_nat iptable_nat nf_nat
xt_TCPMSS xt_u32 xt_connmark iptable_mangle xt_owner xt_CT iptable_raw
xt_state xt_bpf xt_mark xt_conntrack xt_multiport xt_comment xt_tcpudp
xt_set xt_tcpmss iptable_filter ip_set_hash_net ip_set_hash_ip ip_set
nfnetlink sch_fq tcp_bbr nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
8021q garp stp mrp llc skx_edac x86_pkg_temp_thermal kvm_intel kvm
irqbypass crc32_pclmul crc32c_intel aesni_intel rapl
[177243.647594][T1995658]  intel_cstate ipmi_ssif sfc intel_uncore
i2c_i801 xhci_pci i2c_smbus acpi_ipmi i40e mdio ioatdma i2c_core
xhci_hcd tpm_crb dca ipmi_si ipmi_devintf ipmi_msghandler tpm_tis
tpm_tis_core tpm tiny_power_button button fuse efivarfs ip_tables
x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
[177243.831600][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
      O      5.15.5-cloudflare-kasan-2021.11.11 #1
[177243.831609][T1995658] Hardware name: Quanta Cloud Technology Inc.
QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
[177243.831612][T1995658] RIP:
0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
[177243.886990][T1995658] Code: 00 00 00 00 fc ff df 48 c1 ea 03 0f b6
14 02 48 89 e8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 0f 8b 43 34 85
c0 74 03 5b 5d c3 <0f> 0b 5b 5d c3 48 89 ef e8 c5 68 72 e1 eb e7 e8 ce
68 72 e1 eb 9b
[177243.919270][T1995658] RSP: 0018:ffff8881ec51f300 EFLAGS: 00010246
[177243.919276][T1995658] RAX: 0000000000000000 RBX: ffffea003d52e900
RCX: ffffffffc143242e
[177243.919279][T1995658] RDX: 0000000000000000 RSI: 0000000000000004
RDI: ffffea003d52e934
[177243.960080][T1995658] RBP: ffffea003d52e934 R08: 0000000000000000
R09: ffffea003d52e937
[177243.960083][T1995658] R10: fffff94007aa5d26 R11: 0000000000000000
R12: ffff88b03ffd9008
[177243.960085][T1995658] R13: 0600000f54ba4b77 R14: 0000000000000001
R15: 0600000f54ba4b01
[177243.960088][T1995658] FS:  0000000000000000(0000)
GS:ffff8897a9b80000(0000) knlGS:0000000000000000
[177244.017687][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177244.017691][T1995658] CR2: 000000c001503000 CR3: 0000001a1fb38006
CR4: 00000000007726e0
[177244.017693][T1995658] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[177244.058771][T1995658] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[177244.058774][T1995658] PKRU: 00000000
[177244.058776][T1995658] Call Trace:
[177244.058780][T1995658]  <TASK>
[177244.058782][T1995658]  kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
[177244.111155][T1995658]  __handle_changed_spte+0x92e/0xca0 [kvm]
[177244.111274][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177244.133938][T1995658]  ? sched_clock_cpu+0x15/0x190
[177244.144289][T1995658]  ? _raw_spin_lock+0xc8/0xd0
[177244.144299][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
[177244.165796][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177244.165906][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
[177244.187769][T1995658]  ? deref_stack_reg+0xe6/0x160
[177244.187779][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
[177244.209044][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177244.220140][T1995658]  ? update_curr+0x18d/0x5f0
[177244.220148][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
[177244.240784][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177244.251827][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
[177244.251922][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
[177244.273410][T1995658]  ? smp_call_function_single+0x271/0x370
[177244.283892][T1995658]  ? _raw_spin_lock+0x81/0xd0
[177244.283900][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
[177244.283904][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
[177244.313033][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
[177244.323097][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
[177244.332862][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
[177244.341982][T1995658]  ? _raw_spin_lock+0xd0/0xd0
[177244.350556][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
[177244.360080][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
[177244.369396][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
[177244.378777][T1995658]  ?
kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
[177244.389571][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
[177244.398156][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
[177244.398238][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
[177244.416160][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
[177244.424372][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
[177244.432743][T1995658]  __fput+0x1f7/0x8c0
[177244.432749][T1995658]  task_work_run+0xf8/0x1a0
[177244.447257][T1995658]  do_exit+0x97b/0x2230
[177244.447263][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
[177244.462773][T1995658]  ? mm_update_next_owner+0x750/0x750
[177244.471117][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
[177244.471122][T1995658]  do_group_exit+0xda/0x2a0
[177244.471126][T1995658]  get_signal+0x3be/0x1e50
[177244.471133][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
[177244.502424][T1995658]  ? audit_log_exit+0x2690/0x2690
[177244.502432][T1995658]  ? shmem_evict_inode+0xad0/0xad0
[177244.518403][T1995658]  ? get_sigframe_size+0x10/0x10
[177244.526157][T1995658]  ? __seccomp_filter+0x117/0xd60
[177244.526162][T1995658]  ? audit_alloc_name+0x440/0x440
[177244.526166][T1995658]  ? get_nth_filter.part.0+0x220/0x220
[177244.526170][T1995658]  ? __audit_syscall_exit+0x794/0xa80
[177244.558471][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
[177244.558478][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
[177244.575193][T1995658]  do_syscall_64+0x4d/0x90
[177244.575197][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[177244.591189][T1995658] RIP: 0033:0x4890ca
[177244.591199][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
[177244.591201][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
ORIG_RAX: 000000000000011d
[177244.607705][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
RCX: 00000000004890ca
[177244.630001][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
RDI: 0000000000000010
[177244.630003][T1995658] RBP: 000000c000508660 R08: 0000000000000000
R09: 0000000000000000
[177244.630005][T1995658] R10: 000000000676b000 R11: 0000000000000216
R12: 0000000000000009
[177244.630007][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
R15: 0000000000000000
[177244.630011][T1995658]  </TASK>
[177244.679411][T1995658] ---[ end trace bf693a2532f213e4 ]---

Then immediately this KASAN warning:

[177244.798046][T1995658] BUG: KASAN: slab-out-of-bounds in
workingset_activation+0x2b2/0x2f0
[177244.809161][T1995658] Read of size 8 at addr ffff8881749ab3b8 by
task exe/1995658
[177244.819636][T1995658]
[177244.824947][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
   W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
[177244.838672][T1995658] Hardware name: Quanta Cloud Technology Inc.
QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
[177244.857210][T1995658] Call Trace:
[177244.863733][T1995658]  <TASK>
[177244.869871][T1995658]  dump_stack_lvl+0x34/0x44
[177244.877583][T1995658]  print_address_description.constprop.0+0x1f/0x140
[177244.887430][T1995658]  ? workingset_activation+0x2b2/0x2f0
[177244.896156][T1995658]  ? workingset_activation+0x2b2/0x2f0
[177244.904809][T1995658]  kasan_report.cold+0x83/0xdf
[177244.912778][T1995658]  ? kvm_is_zone_device_pfn.part.0+0x40/0xd0 [kvm]
[177244.922553][T1995658]  ? workingset_activation+0x2b2/0x2f0
[177244.931236][T1995658]  workingset_activation+0x2b2/0x2f0
[177244.939785][T1995658]  mark_page_accessed+0x44a/0x6a0
[177244.948022][T1995658]  __handle_changed_spte+0x64c/0xca0 [kvm]
[177244.957192][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177244.966249][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
[177244.974930][T1995658]  ? deref_stack_reg+0xe6/0x160
[177244.983013][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
[177244.992167][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177245.001249][T1995658]  ? update_curr+0x18d/0x5f0
[177245.009268][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
[177245.018454][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177245.027556][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
[177245.036055][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
[177245.045907][T1995658]  ? smp_call_function_single+0x271/0x370
[177245.054992][T1995658]  ? _raw_spin_lock+0x81/0xd0
[177245.063045][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
[177245.071332][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
[177245.080861][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
[177245.090147][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
[177245.099215][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
[177245.107719][T1995658]  ? _raw_spin_lock+0xd0/0xd0
[177245.115776][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
[177245.124919][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
[177245.133974][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
[177245.143143][T1995658]  ?
kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
[177245.153754][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
[177245.162289][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
[177245.171426][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
[177245.180637][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
[177245.189087][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
[177245.197744][T1995658]  __fput+0x1f7/0x8c0
[177245.205096][T1995658]  task_work_run+0xf8/0x1a0
[177245.212916][T1995658]  do_exit+0x97b/0x2230
[177245.220332][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
[177245.229016][T1995658]  ? mm_update_next_owner+0x750/0x750
[177245.237583][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
[177245.245731][T1995658]  do_group_exit+0xda/0x2a0
[177245.253265][T1995658]  get_signal+0x3be/0x1e50
[177245.260585][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
[177245.269239][T1995658]  ? audit_log_exit+0x2690/0x2690
[177245.277218][T1995658]  ? shmem_evict_inode+0xad0/0xad0
[177245.285295][T1995658]  ? get_sigframe_size+0x10/0x10
[177245.293162][T1995658]  ? __seccomp_filter+0x117/0xd60
[177245.301157][T1995658]  ? audit_alloc_name+0x440/0x440
[177245.309224][T1995658]  ? get_nth_filter.part.0+0x220/0x220
[177245.317678][T1995658]  ? __audit_syscall_exit+0x794/0xa80
[177245.325968][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
[177245.334463][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
[177245.342837][T1995658]  do_syscall_64+0x4d/0x90
[177245.350176][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[177245.358991][T1995658] RIP: 0033:0x4890ca
[177245.365778][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
[177245.375596][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
ORIG_RAX: 000000000000011d
[177245.387085][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
RCX: 00000000004890ca
[177245.387092][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
RDI: 0000000000000010
[177245.387096][T1995658] RBP: 000000c000508660 R08: 0000000000000000
R09: 0000000000000000
[177245.387100][T1995658] R10: 000000000676b000 R11: 0000000000000216
R12: 0000000000000009
[177245.387103][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
R15: 0000000000000000
[177245.441910][T1995658]  </TASK>
[177245.447826][T1995658]
[177245.447827][T1995658] Allocated by task 182586:
[177245.447830][T1995658]  kasan_save_stack+0x20/0x50
[177245.467811][T1995658]  __kasan_kmalloc+0xa4/0xd0
[177245.475233][T1995658]  fib6_info_alloc+0xa2/0x1d0
[177245.475240][T1995658]  ip6_route_info_create+0x29f/0x1a30
[177245.475243][T1995658]  ip6_route_add+0x18/0x100
[177245.475246][T1995658]  addrconf_add_mroute+0x157/0x1b0
[177245.506016][T1995658]  addrconf_notify+0x6a3/0x1510
[177245.506021][T1995658]  notifier_call_chain+0x9e/0x180
[177245.521427][T1995658]  __dev_notify_flags+0xda/0x230
[177245.529175][T1995658]  rtnl_configure_link+0x125/0x200
[177245.529181][T1995658]  __rtnl_newlink+0xd3d/0x13f0
[177245.529186][T1995658]  rtnl_newlink+0x5f/0x90
[177245.551464][T1995658]  rtnetlink_rcv_msg+0x378/0xa40
[177245.551469][T1995658]  netlink_rcv_skb+0x125/0x380
[177245.551472][T1995658]  netlink_unicast+0x4d0/0x7a0
[177245.573756][T1995658]  netlink_sendmsg+0x724/0xc00
[177245.573760][T1995658]  sock_sendmsg+0xe2/0x110
[177245.573764][T1995658]  __sys_sendto+0x1a8/0x270
[177245.595271][T1995658]  __x64_sys_sendto+0xdd/0x1b0
[177245.602600][T1995658]  do_syscall_64+0x40/0x90
[177245.609544][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[177245.609549][T1995658]
[177245.609550][T1995658] The buggy address belongs to the object at
ffff8881749ab200
[177245.609550][T1995658]  which belongs to the cache kmalloc-256 of size 256
[177245.609554][T1995658] The buggy address is located 184 bytes to the right of
[177245.609554][T1995658]  256-byte region [ffff8881749ab200, ffff8881749ab300)
[177245.609558][T1995658] The buggy address belongs to the page:
[177245.609559][T1995658] page:000000009030f8e1 refcount:1 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x1749a8
[177245.682919][T1995658] head:000000009030f8e1 order:2
compound_mapcount:0 compound_pincount:0
[177245.682924][T1995658] flags:
0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
[177245.705180][T1995658] raw: 002ffff800010200 dead000000000100
dead000000000122 ffff88810004cb40
[177245.716738][T1995658] raw: 0000000000000000 0000000000200020
00000001ffffffff 0000000000000000
[177245.716741][T1995658] page dumped because: kasan: bad access detected
[177245.716743][T1995658]
[177245.716744][T1995658] Memory state around the buggy address:
[177245.716747][T1995658]  ffff8881749ab280: 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
[177245.716749][T1995658]  ffff8881749ab300: fc fc fc fc fc fc fc fc
fc fc fc fc fc fc fc fc
[177245.716752][T1995658] >ffff8881749ab380: fc fc fc fc fc fc fc fc
fc fc fc fc fc fc fc fc
[177245.785224][T1995658]                                         ^
[177245.785229][T1995658]  ffff8881749ab400: 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
[177245.785231][T1995658]  ffff8881749ab480: 00 00 00 00 00 00 00 00
00 00 00 00 00 00 fc fc
[177245.816146][T1995658]
==================================================================

And after that:

[177245.816196][T1995658] general protection fault, probably for
non-canonical address 0xdffffc0000000011: 0000 [#1] SMP KASAN PTI
[177245.854054][T1995658] KASAN: null-ptr-deref in range
[0x0000000000000088-0x000000000000008f]
[177245.865836][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
B   W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
[177245.865842][T1995658] Hardware name: Quanta Cloud Technology Inc.
QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
[177245.865845][T1995658] RIP: 0010:workingset_activation+0x175/0x2f0
[177245.909096][T1995658] Code: 80 3c 02 00 0f 85 58 01 00 00 4a 8b ac
ed b8 0f 00 00 48 8d bd 88 00 00 00 48 b8 00 00 00 00 00 fc ff df 48
89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 0d 01 00 00 4c 3b a5 88 00 00 00
0f 85 c3 00 00
[177245.936452][T1995658] RSP: 0018:ffff8881ec51f3c0 EFLAGS: 00010206
[177245.936457][T1995658] RAX: dffffc0000000000 RBX: ffffea003237e480
RCX: ffffffffa25388a6
[177245.936460][T1995658] RDX: 0000000000000011 RSI: 0000000000000246
RDI: 0000000000000088
[177245.970834][T1995658] RBP: 0000000000000000 R08: 0000000000000001
R09: ffffffffa6ced907
[177245.970837][T1995658] R10: fffffbfff4d9db20 R11: 0000000000000010
R12: ffff88983ffde000
[177245.970839][T1995658] R13: 0000000000000000 R14: ffff8897a9bbbde0
R15: 0600000c8df93b77
[177246.007059][T1995658] FS:  0000000000000000(0000)
GS:ffff8897a9b80000(0000) knlGS:0000000000000000
[177246.020326][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177246.020330][T1995658] CR2: 000000c001503000 CR3: 0000001a1fb38006
CR4: 00000000007726e0
[177246.020332][T1995658] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[177246.020334][T1995658] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[177246.020337][T1995658] PKRU: 00000000
[177246.020338][T1995658] Call Trace:
[177246.020341][T1995658]  <TASK>
[177246.090639][T1995658]  mark_page_accessed+0x44a/0x6a0
[177246.099947][T1995658]  __handle_changed_spte+0x64c/0xca0 [kvm]
[177246.110177][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177246.110274][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
[177246.129880][T1995658]  ? deref_stack_reg+0xe6/0x160
[177246.138925][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
[177246.139020][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177246.158926][T1995658]  ? update_curr+0x18d/0x5f0
[177246.167666][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
[177246.167762][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
[177246.187453][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
[177246.196673][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
[177246.196781][T1995658]  ? smp_call_function_single+0x271/0x370
[177246.196789][T1995658]  ? _raw_spin_lock+0x81/0xd0
[177246.196795][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
[177246.234516][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
[177246.244565][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
[177246.244680][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
[177246.263832][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
[177246.263923][T1995658]  ? _raw_spin_lock+0xd0/0xd0
[177246.281421][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
[177246.291062][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
[177246.291156][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
[177246.310225][T1995658]  ?
kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
[177246.310297][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
[177246.330051][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
[177246.339422][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
[177246.339434][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
[177246.357081][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
[177246.365745][T1995658]  __fput+0x1f7/0x8c0
[177246.365751][T1995658]  task_work_run+0xf8/0x1a0
[177246.380735][T1995658]  do_exit+0x97b/0x2230
[177246.388056][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
[177246.388062][T1995658]  ? mm_update_next_owner+0x750/0x750
[177246.388067][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
[177246.413105][T1995658]  do_group_exit+0xda/0x2a0
[177246.420604][T1995658]  get_signal+0x3be/0x1e50
[177246.427936][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
[177246.427943][T1995658]  ? audit_log_exit+0x2690/0x2690
[177246.444593][T1995658]  ? shmem_evict_inode+0xad0/0xad0
[177246.444601][T1995658]  ? get_sigframe_size+0x10/0x10
[177246.460403][T1995658]  ? __seccomp_filter+0x117/0xd60
[177246.468231][T1995658]  ? audit_alloc_name+0x440/0x440
[177246.468236][T1995658]  ? get_nth_filter.part.0+0x220/0x220
[177246.468239][T1995658]  ? __audit_syscall_exit+0x794/0xa80
[177246.468244][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
[177246.486130][T2343516] BUG: Bad page state in process dnsdist  pfn:3ac443
[177246.492473][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
[177246.500835][T2343516] page:000000000df8ed4b refcount:0 mapcount:0
mapping:0000000000000000 index:0x1 pfn:0x3ac443
[177246.510416][T1995658]  do_syscall_64+0x4d/0x90
[177246.518772][T2343516] flags:
0x2ffff80000000a(referenced|dirty|node=0|zone=2|lastcpupid=0x1ffff)
[177246.532071][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[177246.539574][T2343516] raw: 002ffff80000000a dead000000000100
dead000000000122 0000000000000000
[177246.551516][T1995658] RIP: 0033:0x4890ca
[177246.560640][T2343516] raw: 0000000000000001 0000000000000000
00000000ffffffff 0000000000000000
[177246.572553][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
[177246.572556][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
[177246.579708][T2343516] page dumped because:
PAGE_FLAGS_CHECK_AT_PREP flag(s) set
[177246.591667][T1995658]  ORIG_RAX: 000000000000011d
[177246.601947][T2343516] Modules linked in: xt_hashlimit
[177246.611442][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
RCX: 00000000004890ca
[177246.622186][T2343516]  xt_connlimit nf_conncount
[177246.630505][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
RDI: 0000000000000010
[177246.639127][T2343516]  ip_set_hash_netport xt_length
[177246.650831][T1995658] RBP: 000000c000508660 R08: 0000000000000000
R09: 0000000000000000
[177246.659146][T2343516]  esp4 sit
[177246.670801][T1995658] R10: 000000000676b000 R11: 0000000000000216
R12: 0000000000000009
[177246.679548][T2343516]  ipip tunnel4
[177246.691327][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
R15: 0000000000000000
[177246.691332][T1995658]  </TASK>
[177246.698177][T2343516]  nft_numgen
[177246.710054][T1995658] Modules linked in: xt_hashlimit
[177246.717376][T2343516]  nft_ct ip_gre
[177246.729337][T1995658]  xt_connlimit nf_conncount
[177246.736385][T2343516]  gre xfrm_user
[177246.743566][T1995658]  ip_set_hash_netport xt_length
[177246.752511][T2343516]  xfrm_algo tcp_diag
[177246.759923][T1995658]  esp4 sit
[177246.768447][T2343516]  udp_diag inet_diag
[177246.775861][T1995658]  ipip tunnel4
[177246.784702][T2343516]  fou6 fou
[177246.792590][T1995658]  nft_numgen nft_ct
[177246.799633][T2343516]  ip6_tunnel tunnel6
[177246.807521][T1995658]  ip_gre gre
[177246.814891][T2343516]  ip_tunnel
[177246.821906][T1995658]  xfrm_user xfrm_algo
[177246.829831][T2343516]  ip6_udp_tunnel udp_tunnel
[177246.837687][T1995658]  tcp_diag udp_diag
[177246.844899][T2343516]  cls_bpf tls
[177246.851988][T1995658]  inet_diag fou6
[177246.860054][T2343516]  xt_NFLOG
[177246.868476][T1995658]  fou ip6_tunnel
[177246.876205][T2343516]  xt_statistic
[177246.883435][T1995658]  tunnel6 ip_tunnel
[177246.890967][T2343516]  nft_compat
[177246.897928][T1995658]  ip6_udp_tunnel udp_tunnel
[177246.905366][T2343516]  veth
[177246.912707][T1995658]  cls_bpf tls
[177246.920424][T2343516]  tun overlay
[177246.927451][T1995658]  xt_NFLOG xt_statistic
[177246.935692][T2343516]  macvlan
[177246.941990][T1995658]  nft_compat veth
[177246.948848][T2343516]  sch_ingress raid0
[177246.955617][T1995658]  tun overlay
[177246.963226][T2343516]  md_mod essiv
[177246.969540][T1995658]  macvlan sch_ingress
[177246.976443][T2343516]  dm_crypt trusted
[177246.983523][T1995658]  raid0 md_mod
[177246.990026][T2343516]  asn1_encoder tee
[177246.996500][T1995658]  essiv dm_crypt
[177247.003663][T2343516]  dm_mod dax
[177247.010435][T1995658]  trusted asn1_encoder
[177247.016794][T2343516]  nfnetlink_log nft_log
[177247.023445][T1995658]  tee dm_mod
[177247.029892][T2343516]  nft_limit
[177247.035861][T1995658]  dax nfnetlink_log
[177247.042663][T2343516]  nft_counter nf_tables
[177247.049640][T1995658]  nft_log
[177247.055470][T2343516]  ip6table_nat
[177247.061129][T1995658]  nft_limit nft_counter
[177247.067401][T2343516]  ip6table_mangle
[177247.073987][T1995658]  nf_tables ip6table_nat
[177247.079498][T2343516]  ip6table_security ip6table_raw
[177247.085277][T1995658]  ip6table_mangle
[177247.091991][T2343516]  ip6table_filter ip6_tables
[177247.098034][T1995658]  ip6table_security ip6table_raw
[177247.104695][T2343516]  xt_nat iptable_nat
[177247.112034][T1995658]  ip6table_filter ip6_tables
[177247.118039][T2343516]  nf_nat xt_TCPMSS
[177247.125004][T1995658]  xt_nat iptable_nat
[177247.132376][T2343516]  xt_u32
[177247.138705][T1995658]  nf_nat xt_TCPMSS xt_u32
[177247.145715][T2343516]  xt_connmark
[177247.151843][T1995658]  xt_connmark iptable_mangle xt_owner
[177247.158146][T2343516]  iptable_mangle
[177247.163422][T1995658]  xt_CT iptable_raw
[177247.170185][T2343516]  xt_owner xt_CT
[177247.175921][T1995658]  xt_state xt_bpf
[177247.183751][T2343516]  iptable_raw xt_state
[177247.189765][T1995658]  xt_mark xt_conntrack
[177247.196014][T2343516]  xt_bpf xt_mark
[177247.202024][T1995658]  xt_multiport xt_comment xt_tcpudp
[177247.208091][T2343516]  xt_conntrack
[177247.214620][T1995658]  xt_set xt_tcpmss iptable_filter
[177247.221195][T2343516]  xt_multiport
[177247.227181][T1995658]  ip_set_hash_net
[177247.234862][T2343516]  xt_comment xt_tcpudp
[177247.240694][T1995658]  ip_set_hash_ip ip_set
[177247.248244][T2343516]  xt_set xt_tcpmss
[177247.254108][T1995658]  nfnetlink sch_fq
[177247.260257][T2343516]  iptable_filter ip_set_hash_net
[177247.266823][T1995658]  tcp_bbr nf_conntrack
[177247.273564][T2343516]  ip_set_hash_ip ip_set
[177247.279798][T1995658]  nf_defrag_ipv6 nf_defrag_ipv4
[177247.286054][T2343516]  nfnetlink sch_fq
[177247.293537][T1995658]  8021q garp
[177247.300146][T2343516]  tcp_bbr nf_conntrack
[177247.306851][T1995658]  stp mrp
[177247.314280][T2343516]  nf_defrag_ipv6
[177247.320726][T1995658]  llc skx_edac
[177247.326501][T2343516]  nf_defrag_ipv4
[177247.333200][T1995658]  x86_pkg_temp_thermal
[177247.338746][T2343516]  8021q garp
[177247.344900][T1995658]  kvm_intel kvm
[177247.350878][T2343516]  stp mrp
[177247.356981][T1995658]  irqbypass crc32_pclmul
[177247.363642][T2343516]  llc skx_edac
[177247.369407][T1995658]  crc32c_intel aesni_intel
[177247.375433][T2343516]  x86_pkg_temp_thermal kvm_intel
[177247.380949][T1995658]  rapl intel_cstate
[177247.387818][T2343516]  kvm irqbypass
[177247.393792][T1995658]  ipmi_ssif sfc
[177247.400823][T2343516]  crc32_pclmul crc32c_intel
[177247.408402][T1995658]  intel_uncore
[177247.414816][T2343516]  aesni_intel rapl
[177247.420895][T1995658]  i2c_i801 xhci_pci
[177247.426998][T2343516]  intel_cstate
[177247.434174][T1995658]  i2c_smbus acpi_ipmi
[177247.440133][T2343516]  ipmi_ssif
[177247.446447][T1995658]  i40e mdio
[177247.452828][T2343516]  sfc intel_uncore
[177247.458764][T1995658]  ioatdma i2c_core xhci_hcd tpm_crb
[177247.465315][T2343516]  i2c_i801
[177247.470986][T1995658]  dca ipmi_si
[177247.476630][T2343516]  xhci_pci i2c_smbus
[177247.482924][T1995658]  ipmi_devintf ipmi_msghandler
[177247.490884][T2343516]  acpi_ipmi
[177247.496448][T1995658]  tpm_tis tpm_tis_core
[177247.502266][T2343516]  i40e
[177247.508870][T1995658]  tpm tiny_power_button
[177247.516227][T2343516]  mdio
[177247.521960][T1995658]  button fuse
[177247.528752][T2343516]  ioatdma i2c_core
[177247.534017][T1995658]  efivarfs
[177247.540736][T2343516]  xhci_hcd tpm_crb
[177247.545945][T1995658]  ip_tables x_tables
[177247.551808][T2343516]  dca ipmi_si
[177247.558097][T1995658]  bcmcrypt(O)
[177247.563798][T2343516]  ipmi_devintf ipmi_msghandler
[177247.570085][T1995658]  crypto_simd
[177247.576506][T2343516]  tpm_tis tpm_tis_core
[177247.582311][T1995658]  cryptd [last unloaded: kheaders]
[177247.582349][T1995658] ---[ end trace bf693a2532f213e5 ]---
[177247.588090][T2343516]  tpm tiny_power_button button fuse efivarfs ip_tables
[177247.671277][T1995658] RIP: 0010:workingset_activation+0x175/0x2f0
[177247.673366][T2343516]  x_tables bcmcrypt(O) crypto_simd
[177247.682978][T1995658] Code: 80 3c 02 00 0f 85 58 01 00 00 4a 8b ac
ed b8 0f 00 00 48 8d bd 88 00 00 00 48 b8 00 00 00 00 00 fc ff df 48
89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 0d 01 00 00 4c 3b a5 88 00 00 00
0f 85 c3 00 00
[177247.691625][T2343516]  cryptd [last unloaded: kheaders]
[177247.691633][T2343516] CPU: 30 PID: 2343516 Comm: dnsdist Tainted:
G    B D W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
[177247.691639][T2343516] Hardware name: Quanta Cloud Technology Inc.
QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
[177247.691642][T2343516] Call Trace:
[177247.691645][T2343516]  <TASK>
[177247.699575][T1995658] RSP: 0018:ffff8881ec51f3c0 EFLAGS: 00010206
[177247.724584][T2343516]  dump_stack_lvl+0x34/0x44
[177247.724595][T2343516]  bad_page.cold+0xc0/0xe1
[177247.732606][T1995658]
[177247.746616][T2343516]  rmqueue_bulk+0x8e5/0xe00
[177247.746627][T2343516]  ? find_suitable_fallback+0x470/0x470
[177247.764885][T1995658] RAX: dffffc0000000000 RBX: ffffea003237e480
RCX: ffffffffa25388a6
[177247.771268][T2343516]  get_page_from_freelist+0x18ff/0x2920
[177247.771279][T2343516]  ? __zone_watermark_ok+0x340/0x340
[177247.777285][T1995658] RDX: 0000000000000011 RSI: 0000000000000246
RDI: 0000000000000088
[177247.786537][T2343516]  __alloc_pages+0x2ac/0x5b0
[177247.786543][T2343516]  ? __alloc_pages_slowpath.constprop.0+0x1e20/0x1e20
[177247.786548][T2343516]  alloc_pages_vma+0xbc/0x570
[177247.794244][T1995658] RBP: 0000000000000000 R08: 0000000000000001
R09: ffffffffa6ced907
[177247.801816][T2343516]  __handle_mm_fault+0x14d6/0x3a00
[177247.801822][T2343516]  ? vm_iomap_memory+0x1d0/0x1d0
[177247.801827][T2343516]  ? down_read_trylock+0xeb/0x180
[177247.807247][T1995658] R10: fffffbfff4d9db20 R11: 0000000000000010
R12: ffff88983ffde000
[177247.814873][T2343516]  handle_mm_fault+0x1cc/0x650
[177247.814878][T2343516]  do_user_addr_fault+0x303/0xd40
[177247.814885][T2343516]  exc_page_fault+0x52/0xb0
[177247.823577][T1995658] R13: 0000000000000000 R14: ffff8897a9bbbde0
R15: 0600000c8df93b77
[177247.834904][T2343516]  ? asm_exc_page_fault+0x8/0x30
[177247.834911][T2343516]  asm_exc_page_fault+0x1e/0x30
[177247.834915][T2343516] RIP: 0033:0x557a8ea64f19
[177247.843738][T1995658] FS:  0000000000000000(0000)
GS:ffff8897a9b80000(0000) knlGS:0000000000000000
[177247.852290][T2343516] Code: c7 80 04 ff ff ff 00 00 00 00 c7 80 18
ff ff ff 00 00 00 00 48 c7 80 20 ff ff ff 00 00 00 00 48 c7 80 28 ff
ff ff 00 00 00 00 <c6> 80 30 ff ff ff 01 c6 80 38 ff ff ff 01 c6 80 39
ff ff ff 00 48
[177247.852295][T2343516] RSP: 002b:00007fff6f3f11c0 EFLAGS: 00010246
[177247.852299][T2343516] RAX: 0000557a9840d0d0 RBX: 0000557a92334cc0
RCX: 0000000000000000
[177247.852301][T2343516] RDX: ffffffffffffffff RSI: 0000000000000000
RDI: 0000000000000000
[177247.852304][T2343516] RBP: 00000000000099d4 R08: 0000000000000000
R09: 0000000000000000
[177247.863709][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177247.871673][T2343516] R10: 0000000000000000 R11: 0000000000000000
R12: 0000557a991452f0
[177247.871676][T2343516] R13: 00007fff6f3f1420 R14: 00007fff6f3f1440
R15: 0000000000000001
[177247.871680][T2343516]  </TASK>

After this the machine starts spitting some traces starting with:

[177247.871683][T2343516] BUG: Bad page state in process <some comm
name>  pfn:fe680a

And eventually gradually locks up:

NMI watchdog: Watchdog detected hard LOCKUP on cpu 81

The comment in kvm_main.c before the code mentioned in the first
warning states that the warning is there to indicate incorrect usage
of the function - and probably it is, given the consequences.

About our workload: this bug is most likely triggered by gvisor [1]
with the KVM backend as we don't have any other KVM users on these
systems.

We suspect it was not triggered before as kernels before 5.15 did not
have TDP MMU enabled by default [2].

It seems we even want to remove this warning as overaggressive [3],
however it is indicative in this case.

Unfortunately, I couldn't easily reproduce the issue synthetically
(tried both running the KVM selftests as well as gvisor KVM tests).
Any help/pointers would be appreciated.

[1]: https://github.com/google/gvisor
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=71ba3f3189c78f756a659568fb473600fd78f207
[3]: https://lore.kernel.org/kvm/20211129034317.2964790-5-stevensd@google.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-29 21:44 Potential bug in TDP MMU Ignat Korchagin
@ 2021-11-30  9:29 ` Paolo Bonzini
  2021-11-30 10:58   ` Ignat Korchagin
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2021-11-30  9:29 UTC (permalink / raw)
  To: Ignat Korchagin, kvm; +Cc: stevensd, kernel-team

On 11/29/21 22:44, Ignat Korchagin wrote:
> Hello,
> 
> We have recently started to evaluate 5.15.y kernel series and here is
> what we occasionally get on kernel 5.15.5:

I'm not sure if it's this, but there are quite a few fixes I've queued 
for 5.16-rc4, and I'll be sending a pull request to Linus shortly.  So 
we can revisit this in a week.

This is the most likely fix:

https://patchwork.kernel.org/project/kvm/patch/20211120045046.3940942-2-seanjc@google.com/

Paolo

> [177243.621744][T1995658] WARNING: CPU: 7 PID: 1995658 at
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
> kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
> [177243.647435][T1995658] Modules linked in: xt_hashlimit xt_connlimit
> nf_conncount ip_set_hash_netport xt_length esp4 sit ipip tunnel4
> nft_numgen nft_ct ip_gre gre xfrm_user xfrm_algo tcp_diag udp_diag
> inet_diag fou6 fou ip6_tunnel tunnel6 ip_tunnel ip6_udp_tunnel
> udp_tunnel cls_bpf tls xt_NFLOG xt_statistic nft_compat veth tun
> overlay macvlan sch_ingress raid0 md_mod essiv dm_crypt trusted
> asn1_encoder tee dm_mod dax nfnetlink_log nft_log nft_limit
> nft_counter nf_tables ip6table_nat ip6table_mangle ip6table_security
> ip6table_raw ip6table_filter ip6_tables xt_nat iptable_nat nf_nat
> xt_TCPMSS xt_u32 xt_connmark iptable_mangle xt_owner xt_CT iptable_raw
> xt_state xt_bpf xt_mark xt_conntrack xt_multiport xt_comment xt_tcpudp
> xt_set xt_tcpmss iptable_filter ip_set_hash_net ip_set_hash_ip ip_set
> nfnetlink sch_fq tcp_bbr nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> 8021q garp stp mrp llc skx_edac x86_pkg_temp_thermal kvm_intel kvm
> irqbypass crc32_pclmul crc32c_intel aesni_intel rapl
> [177243.647594][T1995658]  intel_cstate ipmi_ssif sfc intel_uncore
> i2c_i801 xhci_pci i2c_smbus acpi_ipmi i40e mdio ioatdma i2c_core
> xhci_hcd tpm_crb dca ipmi_si ipmi_devintf ipmi_msghandler tpm_tis
> tpm_tis_core tpm tiny_power_button button fuse efivarfs ip_tables
> x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
> [177243.831600][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
>        O      5.15.5-cloudflare-kasan-2021.11.11 #1
> [177243.831609][T1995658] Hardware name: Quanta Cloud Technology Inc.
> QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> [177243.831612][T1995658] RIP:
> 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
> [177243.886990][T1995658] Code: 00 00 00 00 fc ff df 48 c1 ea 03 0f b6
> 14 02 48 89 e8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 0f 8b 43 34 85
> c0 74 03 5b 5d c3 <0f> 0b 5b 5d c3 48 89 ef e8 c5 68 72 e1 eb e7 e8 ce
> 68 72 e1 eb 9b
> [177243.919270][T1995658] RSP: 0018:ffff8881ec51f300 EFLAGS: 00010246
> [177243.919276][T1995658] RAX: 0000000000000000 RBX: ffffea003d52e900
> RCX: ffffffffc143242e
> [177243.919279][T1995658] RDX: 0000000000000000 RSI: 0000000000000004
> RDI: ffffea003d52e934
> [177243.960080][T1995658] RBP: ffffea003d52e934 R08: 0000000000000000
> R09: ffffea003d52e937
> [177243.960083][T1995658] R10: fffff94007aa5d26 R11: 0000000000000000
> R12: ffff88b03ffd9008
> [177243.960085][T1995658] R13: 0600000f54ba4b77 R14: 0000000000000001
> R15: 0600000f54ba4b01
> [177243.960088][T1995658] FS:  0000000000000000(0000)
> GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> [177244.017687][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [177244.017691][T1995658] CR2: 000000c001503000 CR3: 0000001a1fb38006
> CR4: 00000000007726e0
> [177244.017693][T1995658] DR0: 0000000000000000 DR1: 0000000000000000
> DR2: 0000000000000000
> [177244.058771][T1995658] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> DR7: 0000000000000400
> [177244.058774][T1995658] PKRU: 00000000
> [177244.058776][T1995658] Call Trace:
> [177244.058780][T1995658]  <TASK>
> [177244.058782][T1995658]  kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
> [177244.111155][T1995658]  __handle_changed_spte+0x92e/0xca0 [kvm]
> [177244.111274][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177244.133938][T1995658]  ? sched_clock_cpu+0x15/0x190
> [177244.144289][T1995658]  ? _raw_spin_lock+0xc8/0xd0
> [177244.144299][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> [177244.165796][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177244.165906][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> [177244.187769][T1995658]  ? deref_stack_reg+0xe6/0x160
> [177244.187779][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> [177244.209044][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177244.220140][T1995658]  ? update_curr+0x18d/0x5f0
> [177244.220148][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> [177244.240784][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177244.251827][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> [177244.251922][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> [177244.273410][T1995658]  ? smp_call_function_single+0x271/0x370
> [177244.283892][T1995658]  ? _raw_spin_lock+0x81/0xd0
> [177244.283900][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> [177244.283904][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> [177244.313033][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> [177244.323097][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> [177244.332862][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> [177244.341982][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> [177244.350556][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> [177244.360080][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> [177244.369396][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> [177244.378777][T1995658]  ?
> kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> [177244.389571][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> [177244.398156][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> [177244.398238][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> [177244.416160][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> [177244.424372][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> [177244.432743][T1995658]  __fput+0x1f7/0x8c0
> [177244.432749][T1995658]  task_work_run+0xf8/0x1a0
> [177244.447257][T1995658]  do_exit+0x97b/0x2230
> [177244.447263][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> [177244.462773][T1995658]  ? mm_update_next_owner+0x750/0x750
> [177244.471117][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> [177244.471122][T1995658]  do_group_exit+0xda/0x2a0
> [177244.471126][T1995658]  get_signal+0x3be/0x1e50
> [177244.471133][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> [177244.502424][T1995658]  ? audit_log_exit+0x2690/0x2690
> [177244.502432][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> [177244.518403][T1995658]  ? get_sigframe_size+0x10/0x10
> [177244.526157][T1995658]  ? __seccomp_filter+0x117/0xd60
> [177244.526162][T1995658]  ? audit_alloc_name+0x440/0x440
> [177244.526166][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> [177244.526170][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> [177244.558471][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> [177244.558478][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> [177244.575193][T1995658]  do_syscall_64+0x4d/0x90
> [177244.575197][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [177244.591189][T1995658] RIP: 0033:0x4890ca
> [177244.591199][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> [177244.591201][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> ORIG_RAX: 000000000000011d
> [177244.607705][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> RCX: 00000000004890ca
> [177244.630001][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> RDI: 0000000000000010
> [177244.630003][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> R09: 0000000000000000
> [177244.630005][T1995658] R10: 000000000676b000 R11: 0000000000000216
> R12: 0000000000000009
> [177244.630007][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> R15: 0000000000000000
> [177244.630011][T1995658]  </TASK>
> [177244.679411][T1995658] ---[ end trace bf693a2532f213e4 ]---
> 
> Then immediately this KASAN warning:
> 
> [177244.798046][T1995658] BUG: KASAN: slab-out-of-bounds in
> workingset_activation+0x2b2/0x2f0
> [177244.809161][T1995658] Read of size 8 at addr ffff8881749ab3b8 by
> task exe/1995658
> [177244.819636][T1995658]
> [177244.824947][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
>     W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> [177244.838672][T1995658] Hardware name: Quanta Cloud Technology Inc.
> QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> [177244.857210][T1995658] Call Trace:
> [177244.863733][T1995658]  <TASK>
> [177244.869871][T1995658]  dump_stack_lvl+0x34/0x44
> [177244.877583][T1995658]  print_address_description.constprop.0+0x1f/0x140
> [177244.887430][T1995658]  ? workingset_activation+0x2b2/0x2f0
> [177244.896156][T1995658]  ? workingset_activation+0x2b2/0x2f0
> [177244.904809][T1995658]  kasan_report.cold+0x83/0xdf
> [177244.912778][T1995658]  ? kvm_is_zone_device_pfn.part.0+0x40/0xd0 [kvm]
> [177244.922553][T1995658]  ? workingset_activation+0x2b2/0x2f0
> [177244.931236][T1995658]  workingset_activation+0x2b2/0x2f0
> [177244.939785][T1995658]  mark_page_accessed+0x44a/0x6a0
> [177244.948022][T1995658]  __handle_changed_spte+0x64c/0xca0 [kvm]
> [177244.957192][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177244.966249][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> [177244.974930][T1995658]  ? deref_stack_reg+0xe6/0x160
> [177244.983013][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> [177244.992167][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177245.001249][T1995658]  ? update_curr+0x18d/0x5f0
> [177245.009268][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> [177245.018454][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177245.027556][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> [177245.036055][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> [177245.045907][T1995658]  ? smp_call_function_single+0x271/0x370
> [177245.054992][T1995658]  ? _raw_spin_lock+0x81/0xd0
> [177245.063045][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> [177245.071332][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> [177245.080861][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> [177245.090147][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> [177245.099215][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> [177245.107719][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> [177245.115776][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> [177245.124919][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> [177245.133974][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> [177245.143143][T1995658]  ?
> kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> [177245.153754][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> [177245.162289][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> [177245.171426][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> [177245.180637][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> [177245.189087][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> [177245.197744][T1995658]  __fput+0x1f7/0x8c0
> [177245.205096][T1995658]  task_work_run+0xf8/0x1a0
> [177245.212916][T1995658]  do_exit+0x97b/0x2230
> [177245.220332][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> [177245.229016][T1995658]  ? mm_update_next_owner+0x750/0x750
> [177245.237583][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> [177245.245731][T1995658]  do_group_exit+0xda/0x2a0
> [177245.253265][T1995658]  get_signal+0x3be/0x1e50
> [177245.260585][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> [177245.269239][T1995658]  ? audit_log_exit+0x2690/0x2690
> [177245.277218][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> [177245.285295][T1995658]  ? get_sigframe_size+0x10/0x10
> [177245.293162][T1995658]  ? __seccomp_filter+0x117/0xd60
> [177245.301157][T1995658]  ? audit_alloc_name+0x440/0x440
> [177245.309224][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> [177245.317678][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> [177245.325968][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> [177245.334463][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> [177245.342837][T1995658]  do_syscall_64+0x4d/0x90
> [177245.350176][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [177245.358991][T1995658] RIP: 0033:0x4890ca
> [177245.365778][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> [177245.375596][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> ORIG_RAX: 000000000000011d
> [177245.387085][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> RCX: 00000000004890ca
> [177245.387092][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> RDI: 0000000000000010
> [177245.387096][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> R09: 0000000000000000
> [177245.387100][T1995658] R10: 000000000676b000 R11: 0000000000000216
> R12: 0000000000000009
> [177245.387103][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> R15: 0000000000000000
> [177245.441910][T1995658]  </TASK>
> [177245.447826][T1995658]
> [177245.447827][T1995658] Allocated by task 182586:
> [177245.447830][T1995658]  kasan_save_stack+0x20/0x50
> [177245.467811][T1995658]  __kasan_kmalloc+0xa4/0xd0
> [177245.475233][T1995658]  fib6_info_alloc+0xa2/0x1d0
> [177245.475240][T1995658]  ip6_route_info_create+0x29f/0x1a30
> [177245.475243][T1995658]  ip6_route_add+0x18/0x100
> [177245.475246][T1995658]  addrconf_add_mroute+0x157/0x1b0
> [177245.506016][T1995658]  addrconf_notify+0x6a3/0x1510
> [177245.506021][T1995658]  notifier_call_chain+0x9e/0x180
> [177245.521427][T1995658]  __dev_notify_flags+0xda/0x230
> [177245.529175][T1995658]  rtnl_configure_link+0x125/0x200
> [177245.529181][T1995658]  __rtnl_newlink+0xd3d/0x13f0
> [177245.529186][T1995658]  rtnl_newlink+0x5f/0x90
> [177245.551464][T1995658]  rtnetlink_rcv_msg+0x378/0xa40
> [177245.551469][T1995658]  netlink_rcv_skb+0x125/0x380
> [177245.551472][T1995658]  netlink_unicast+0x4d0/0x7a0
> [177245.573756][T1995658]  netlink_sendmsg+0x724/0xc00
> [177245.573760][T1995658]  sock_sendmsg+0xe2/0x110
> [177245.573764][T1995658]  __sys_sendto+0x1a8/0x270
> [177245.595271][T1995658]  __x64_sys_sendto+0xdd/0x1b0
> [177245.602600][T1995658]  do_syscall_64+0x40/0x90
> [177245.609544][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [177245.609549][T1995658]
> [177245.609550][T1995658] The buggy address belongs to the object at
> ffff8881749ab200
> [177245.609550][T1995658]  which belongs to the cache kmalloc-256 of size 256
> [177245.609554][T1995658] The buggy address is located 184 bytes to the right of
> [177245.609554][T1995658]  256-byte region [ffff8881749ab200, ffff8881749ab300)
> [177245.609558][T1995658] The buggy address belongs to the page:
> [177245.609559][T1995658] page:000000009030f8e1 refcount:1 mapcount:0
> mapping:0000000000000000 index:0x0 pfn:0x1749a8
> [177245.682919][T1995658] head:000000009030f8e1 order:2
> compound_mapcount:0 compound_pincount:0
> [177245.682924][T1995658] flags:
> 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
> [177245.705180][T1995658] raw: 002ffff800010200 dead000000000100
> dead000000000122 ffff88810004cb40
> [177245.716738][T1995658] raw: 0000000000000000 0000000000200020
> 00000001ffffffff 0000000000000000
> [177245.716741][T1995658] page dumped because: kasan: bad access detected
> [177245.716743][T1995658]
> [177245.716744][T1995658] Memory state around the buggy address:
> [177245.716747][T1995658]  ffff8881749ab280: 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00
> [177245.716749][T1995658]  ffff8881749ab300: fc fc fc fc fc fc fc fc
> fc fc fc fc fc fc fc fc
> [177245.716752][T1995658] >ffff8881749ab380: fc fc fc fc fc fc fc fc
> fc fc fc fc fc fc fc fc
> [177245.785224][T1995658]                                         ^
> [177245.785229][T1995658]  ffff8881749ab400: 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00
> [177245.785231][T1995658]  ffff8881749ab480: 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 fc fc
> [177245.816146][T1995658]
> ==================================================================
> 
> And after that:
> 
> [177245.816196][T1995658] general protection fault, probably for
> non-canonical address 0xdffffc0000000011: 0000 [#1] SMP KASAN PTI
> [177245.854054][T1995658] KASAN: null-ptr-deref in range
> [0x0000000000000088-0x000000000000008f]
> [177245.865836][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
> B   W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> [177245.865842][T1995658] Hardware name: Quanta Cloud Technology Inc.
> QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> [177245.865845][T1995658] RIP: 0010:workingset_activation+0x175/0x2f0
> [177245.909096][T1995658] Code: 80 3c 02 00 0f 85 58 01 00 00 4a 8b ac
> ed b8 0f 00 00 48 8d bd 88 00 00 00 48 b8 00 00 00 00 00 fc ff df 48
> 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 0d 01 00 00 4c 3b a5 88 00 00 00
> 0f 85 c3 00 00
> [177245.936452][T1995658] RSP: 0018:ffff8881ec51f3c0 EFLAGS: 00010206
> [177245.936457][T1995658] RAX: dffffc0000000000 RBX: ffffea003237e480
> RCX: ffffffffa25388a6
> [177245.936460][T1995658] RDX: 0000000000000011 RSI: 0000000000000246
> RDI: 0000000000000088
> [177245.970834][T1995658] RBP: 0000000000000000 R08: 0000000000000001
> R09: ffffffffa6ced907
> [177245.970837][T1995658] R10: fffffbfff4d9db20 R11: 0000000000000010
> R12: ffff88983ffde000
> [177245.970839][T1995658] R13: 0000000000000000 R14: ffff8897a9bbbde0
> R15: 0600000c8df93b77
> [177246.007059][T1995658] FS:  0000000000000000(0000)
> GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> [177246.020326][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [177246.020330][T1995658] CR2: 000000c001503000 CR3: 0000001a1fb38006
> CR4: 00000000007726e0
> [177246.020332][T1995658] DR0: 0000000000000000 DR1: 0000000000000000
> DR2: 0000000000000000
> [177246.020334][T1995658] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> DR7: 0000000000000400
> [177246.020337][T1995658] PKRU: 00000000
> [177246.020338][T1995658] Call Trace:
> [177246.020341][T1995658]  <TASK>
> [177246.090639][T1995658]  mark_page_accessed+0x44a/0x6a0
> [177246.099947][T1995658]  __handle_changed_spte+0x64c/0xca0 [kvm]
> [177246.110177][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177246.110274][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> [177246.129880][T1995658]  ? deref_stack_reg+0xe6/0x160
> [177246.138925][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> [177246.139020][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177246.158926][T1995658]  ? update_curr+0x18d/0x5f0
> [177246.167666][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> [177246.167762][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> [177246.187453][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> [177246.196673][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> [177246.196781][T1995658]  ? smp_call_function_single+0x271/0x370
> [177246.196789][T1995658]  ? _raw_spin_lock+0x81/0xd0
> [177246.196795][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> [177246.234516][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> [177246.244565][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> [177246.244680][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> [177246.263832][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> [177246.263923][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> [177246.281421][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> [177246.291062][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> [177246.291156][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> [177246.310225][T1995658]  ?
> kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> [177246.310297][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> [177246.330051][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> [177246.339422][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> [177246.339434][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> [177246.357081][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> [177246.365745][T1995658]  __fput+0x1f7/0x8c0
> [177246.365751][T1995658]  task_work_run+0xf8/0x1a0
> [177246.380735][T1995658]  do_exit+0x97b/0x2230
> [177246.388056][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> [177246.388062][T1995658]  ? mm_update_next_owner+0x750/0x750
> [177246.388067][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> [177246.413105][T1995658]  do_group_exit+0xda/0x2a0
> [177246.420604][T1995658]  get_signal+0x3be/0x1e50
> [177246.427936][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> [177246.427943][T1995658]  ? audit_log_exit+0x2690/0x2690
> [177246.444593][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> [177246.444601][T1995658]  ? get_sigframe_size+0x10/0x10
> [177246.460403][T1995658]  ? __seccomp_filter+0x117/0xd60
> [177246.468231][T1995658]  ? audit_alloc_name+0x440/0x440
> [177246.468236][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> [177246.468239][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> [177246.468244][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> [177246.486130][T2343516] BUG: Bad page state in process dnsdist  pfn:3ac443
> [177246.492473][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> [177246.500835][T2343516] page:000000000df8ed4b refcount:0 mapcount:0
> mapping:0000000000000000 index:0x1 pfn:0x3ac443
> [177246.510416][T1995658]  do_syscall_64+0x4d/0x90
> [177246.518772][T2343516] flags:
> 0x2ffff80000000a(referenced|dirty|node=0|zone=2|lastcpupid=0x1ffff)
> [177246.532071][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [177246.539574][T2343516] raw: 002ffff80000000a dead000000000100
> dead000000000122 0000000000000000
> [177246.551516][T1995658] RIP: 0033:0x4890ca
> [177246.560640][T2343516] raw: 0000000000000001 0000000000000000
> 00000000ffffffff 0000000000000000
> [177246.572553][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> [177246.572556][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> [177246.579708][T2343516] page dumped because:
> PAGE_FLAGS_CHECK_AT_PREP flag(s) set
> [177246.591667][T1995658]  ORIG_RAX: 000000000000011d
> [177246.601947][T2343516] Modules linked in: xt_hashlimit
> [177246.611442][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> RCX: 00000000004890ca
> [177246.622186][T2343516]  xt_connlimit nf_conncount
> [177246.630505][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> RDI: 0000000000000010
> [177246.639127][T2343516]  ip_set_hash_netport xt_length
> [177246.650831][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> R09: 0000000000000000
> [177246.659146][T2343516]  esp4 sit
> [177246.670801][T1995658] R10: 000000000676b000 R11: 0000000000000216
> R12: 0000000000000009
> [177246.679548][T2343516]  ipip tunnel4
> [177246.691327][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> R15: 0000000000000000
> [177246.691332][T1995658]  </TASK>
> [177246.698177][T2343516]  nft_numgen
> [177246.710054][T1995658] Modules linked in: xt_hashlimit
> [177246.717376][T2343516]  nft_ct ip_gre
> [177246.729337][T1995658]  xt_connlimit nf_conncount
> [177246.736385][T2343516]  gre xfrm_user
> [177246.743566][T1995658]  ip_set_hash_netport xt_length
> [177246.752511][T2343516]  xfrm_algo tcp_diag
> [177246.759923][T1995658]  esp4 sit
> [177246.768447][T2343516]  udp_diag inet_diag
> [177246.775861][T1995658]  ipip tunnel4
> [177246.784702][T2343516]  fou6 fou
> [177246.792590][T1995658]  nft_numgen nft_ct
> [177246.799633][T2343516]  ip6_tunnel tunnel6
> [177246.807521][T1995658]  ip_gre gre
> [177246.814891][T2343516]  ip_tunnel
> [177246.821906][T1995658]  xfrm_user xfrm_algo
> [177246.829831][T2343516]  ip6_udp_tunnel udp_tunnel
> [177246.837687][T1995658]  tcp_diag udp_diag
> [177246.844899][T2343516]  cls_bpf tls
> [177246.851988][T1995658]  inet_diag fou6
> [177246.860054][T2343516]  xt_NFLOG
> [177246.868476][T1995658]  fou ip6_tunnel
> [177246.876205][T2343516]  xt_statistic
> [177246.883435][T1995658]  tunnel6 ip_tunnel
> [177246.890967][T2343516]  nft_compat
> [177246.897928][T1995658]  ip6_udp_tunnel udp_tunnel
> [177246.905366][T2343516]  veth
> [177246.912707][T1995658]  cls_bpf tls
> [177246.920424][T2343516]  tun overlay
> [177246.927451][T1995658]  xt_NFLOG xt_statistic
> [177246.935692][T2343516]  macvlan
> [177246.941990][T1995658]  nft_compat veth
> [177246.948848][T2343516]  sch_ingress raid0
> [177246.955617][T1995658]  tun overlay
> [177246.963226][T2343516]  md_mod essiv
> [177246.969540][T1995658]  macvlan sch_ingress
> [177246.976443][T2343516]  dm_crypt trusted
> [177246.983523][T1995658]  raid0 md_mod
> [177246.990026][T2343516]  asn1_encoder tee
> [177246.996500][T1995658]  essiv dm_crypt
> [177247.003663][T2343516]  dm_mod dax
> [177247.010435][T1995658]  trusted asn1_encoder
> [177247.016794][T2343516]  nfnetlink_log nft_log
> [177247.023445][T1995658]  tee dm_mod
> [177247.029892][T2343516]  nft_limit
> [177247.035861][T1995658]  dax nfnetlink_log
> [177247.042663][T2343516]  nft_counter nf_tables
> [177247.049640][T1995658]  nft_log
> [177247.055470][T2343516]  ip6table_nat
> [177247.061129][T1995658]  nft_limit nft_counter
> [177247.067401][T2343516]  ip6table_mangle
> [177247.073987][T1995658]  nf_tables ip6table_nat
> [177247.079498][T2343516]  ip6table_security ip6table_raw
> [177247.085277][T1995658]  ip6table_mangle
> [177247.091991][T2343516]  ip6table_filter ip6_tables
> [177247.098034][T1995658]  ip6table_security ip6table_raw
> [177247.104695][T2343516]  xt_nat iptable_nat
> [177247.112034][T1995658]  ip6table_filter ip6_tables
> [177247.118039][T2343516]  nf_nat xt_TCPMSS
> [177247.125004][T1995658]  xt_nat iptable_nat
> [177247.132376][T2343516]  xt_u32
> [177247.138705][T1995658]  nf_nat xt_TCPMSS xt_u32
> [177247.145715][T2343516]  xt_connmark
> [177247.151843][T1995658]  xt_connmark iptable_mangle xt_owner
> [177247.158146][T2343516]  iptable_mangle
> [177247.163422][T1995658]  xt_CT iptable_raw
> [177247.170185][T2343516]  xt_owner xt_CT
> [177247.175921][T1995658]  xt_state xt_bpf
> [177247.183751][T2343516]  iptable_raw xt_state
> [177247.189765][T1995658]  xt_mark xt_conntrack
> [177247.196014][T2343516]  xt_bpf xt_mark
> [177247.202024][T1995658]  xt_multiport xt_comment xt_tcpudp
> [177247.208091][T2343516]  xt_conntrack
> [177247.214620][T1995658]  xt_set xt_tcpmss iptable_filter
> [177247.221195][T2343516]  xt_multiport
> [177247.227181][T1995658]  ip_set_hash_net
> [177247.234862][T2343516]  xt_comment xt_tcpudp
> [177247.240694][T1995658]  ip_set_hash_ip ip_set
> [177247.248244][T2343516]  xt_set xt_tcpmss
> [177247.254108][T1995658]  nfnetlink sch_fq
> [177247.260257][T2343516]  iptable_filter ip_set_hash_net
> [177247.266823][T1995658]  tcp_bbr nf_conntrack
> [177247.273564][T2343516]  ip_set_hash_ip ip_set
> [177247.279798][T1995658]  nf_defrag_ipv6 nf_defrag_ipv4
> [177247.286054][T2343516]  nfnetlink sch_fq
> [177247.293537][T1995658]  8021q garp
> [177247.300146][T2343516]  tcp_bbr nf_conntrack
> [177247.306851][T1995658]  stp mrp
> [177247.314280][T2343516]  nf_defrag_ipv6
> [177247.320726][T1995658]  llc skx_edac
> [177247.326501][T2343516]  nf_defrag_ipv4
> [177247.333200][T1995658]  x86_pkg_temp_thermal
> [177247.338746][T2343516]  8021q garp
> [177247.344900][T1995658]  kvm_intel kvm
> [177247.350878][T2343516]  stp mrp
> [177247.356981][T1995658]  irqbypass crc32_pclmul
> [177247.363642][T2343516]  llc skx_edac
> [177247.369407][T1995658]  crc32c_intel aesni_intel
> [177247.375433][T2343516]  x86_pkg_temp_thermal kvm_intel
> [177247.380949][T1995658]  rapl intel_cstate
> [177247.387818][T2343516]  kvm irqbypass
> [177247.393792][T1995658]  ipmi_ssif sfc
> [177247.400823][T2343516]  crc32_pclmul crc32c_intel
> [177247.408402][T1995658]  intel_uncore
> [177247.414816][T2343516]  aesni_intel rapl
> [177247.420895][T1995658]  i2c_i801 xhci_pci
> [177247.426998][T2343516]  intel_cstate
> [177247.434174][T1995658]  i2c_smbus acpi_ipmi
> [177247.440133][T2343516]  ipmi_ssif
> [177247.446447][T1995658]  i40e mdio
> [177247.452828][T2343516]  sfc intel_uncore
> [177247.458764][T1995658]  ioatdma i2c_core xhci_hcd tpm_crb
> [177247.465315][T2343516]  i2c_i801
> [177247.470986][T1995658]  dca ipmi_si
> [177247.476630][T2343516]  xhci_pci i2c_smbus
> [177247.482924][T1995658]  ipmi_devintf ipmi_msghandler
> [177247.490884][T2343516]  acpi_ipmi
> [177247.496448][T1995658]  tpm_tis tpm_tis_core
> [177247.502266][T2343516]  i40e
> [177247.508870][T1995658]  tpm tiny_power_button
> [177247.516227][T2343516]  mdio
> [177247.521960][T1995658]  button fuse
> [177247.528752][T2343516]  ioatdma i2c_core
> [177247.534017][T1995658]  efivarfs
> [177247.540736][T2343516]  xhci_hcd tpm_crb
> [177247.545945][T1995658]  ip_tables x_tables
> [177247.551808][T2343516]  dca ipmi_si
> [177247.558097][T1995658]  bcmcrypt(O)
> [177247.563798][T2343516]  ipmi_devintf ipmi_msghandler
> [177247.570085][T1995658]  crypto_simd
> [177247.576506][T2343516]  tpm_tis tpm_tis_core
> [177247.582311][T1995658]  cryptd [last unloaded: kheaders]
> [177247.582349][T1995658] ---[ end trace bf693a2532f213e5 ]---
> [177247.588090][T2343516]  tpm tiny_power_button button fuse efivarfs ip_tables
> [177247.671277][T1995658] RIP: 0010:workingset_activation+0x175/0x2f0
> [177247.673366][T2343516]  x_tables bcmcrypt(O) crypto_simd
> [177247.682978][T1995658] Code: 80 3c 02 00 0f 85 58 01 00 00 4a 8b ac
> ed b8 0f 00 00 48 8d bd 88 00 00 00 48 b8 00 00 00 00 00 fc ff df 48
> 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 0d 01 00 00 4c 3b a5 88 00 00 00
> 0f 85 c3 00 00
> [177247.691625][T2343516]  cryptd [last unloaded: kheaders]
> [177247.691633][T2343516] CPU: 30 PID: 2343516 Comm: dnsdist Tainted:
> G    B D W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> [177247.691639][T2343516] Hardware name: Quanta Cloud Technology Inc.
> QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> [177247.691642][T2343516] Call Trace:
> [177247.691645][T2343516]  <TASK>
> [177247.699575][T1995658] RSP: 0018:ffff8881ec51f3c0 EFLAGS: 00010206
> [177247.724584][T2343516]  dump_stack_lvl+0x34/0x44
> [177247.724595][T2343516]  bad_page.cold+0xc0/0xe1
> [177247.732606][T1995658]
> [177247.746616][T2343516]  rmqueue_bulk+0x8e5/0xe00
> [177247.746627][T2343516]  ? find_suitable_fallback+0x470/0x470
> [177247.764885][T1995658] RAX: dffffc0000000000 RBX: ffffea003237e480
> RCX: ffffffffa25388a6
> [177247.771268][T2343516]  get_page_from_freelist+0x18ff/0x2920
> [177247.771279][T2343516]  ? __zone_watermark_ok+0x340/0x340
> [177247.777285][T1995658] RDX: 0000000000000011 RSI: 0000000000000246
> RDI: 0000000000000088
> [177247.786537][T2343516]  __alloc_pages+0x2ac/0x5b0
> [177247.786543][T2343516]  ? __alloc_pages_slowpath.constprop.0+0x1e20/0x1e20
> [177247.786548][T2343516]  alloc_pages_vma+0xbc/0x570
> [177247.794244][T1995658] RBP: 0000000000000000 R08: 0000000000000001
> R09: ffffffffa6ced907
> [177247.801816][T2343516]  __handle_mm_fault+0x14d6/0x3a00
> [177247.801822][T2343516]  ? vm_iomap_memory+0x1d0/0x1d0
> [177247.801827][T2343516]  ? down_read_trylock+0xeb/0x180
> [177247.807247][T1995658] R10: fffffbfff4d9db20 R11: 0000000000000010
> R12: ffff88983ffde000
> [177247.814873][T2343516]  handle_mm_fault+0x1cc/0x650
> [177247.814878][T2343516]  do_user_addr_fault+0x303/0xd40
> [177247.814885][T2343516]  exc_page_fault+0x52/0xb0
> [177247.823577][T1995658] R13: 0000000000000000 R14: ffff8897a9bbbde0
> R15: 0600000c8df93b77
> [177247.834904][T2343516]  ? asm_exc_page_fault+0x8/0x30
> [177247.834911][T2343516]  asm_exc_page_fault+0x1e/0x30
> [177247.834915][T2343516] RIP: 0033:0x557a8ea64f19
> [177247.843738][T1995658] FS:  0000000000000000(0000)
> GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> [177247.852290][T2343516] Code: c7 80 04 ff ff ff 00 00 00 00 c7 80 18
> ff ff ff 00 00 00 00 48 c7 80 20 ff ff ff 00 00 00 00 48 c7 80 28 ff
> ff ff 00 00 00 00 <c6> 80 30 ff ff ff 01 c6 80 38 ff ff ff 01 c6 80 39
> ff ff ff 00 48
> [177247.852295][T2343516] RSP: 002b:00007fff6f3f11c0 EFLAGS: 00010246
> [177247.852299][T2343516] RAX: 0000557a9840d0d0 RBX: 0000557a92334cc0
> RCX: 0000000000000000
> [177247.852301][T2343516] RDX: ffffffffffffffff RSI: 0000000000000000
> RDI: 0000000000000000
> [177247.852304][T2343516] RBP: 00000000000099d4 R08: 0000000000000000
> R09: 0000000000000000
> [177247.863709][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [177247.871673][T2343516] R10: 0000000000000000 R11: 0000000000000000
> R12: 0000557a991452f0
> [177247.871676][T2343516] R13: 00007fff6f3f1420 R14: 00007fff6f3f1440
> R15: 0000000000000001
> [177247.871680][T2343516]  </TASK>
> 
> After this the machine starts spitting some traces starting with:
> 
> [177247.871683][T2343516] BUG: Bad page state in process <some comm
> name>  pfn:fe680a
> 
> And eventually gradually locks up:
> 
> NMI watchdog: Watchdog detected hard LOCKUP on cpu 81
> 
> The comment in kvm_main.c before the code mentioned in the first
> warning states that the warning is there to indicate incorrect usage
> of the function - and probably it is, given the consequences.
> 
> About our workload: this bug is most likely triggered by gvisor [1]
> with the KVM backend as we don't have any other KVM users on these
> systems.
> 
> We suspect it was not triggered before as kernels before 5.15 did not
> have TDP MMU enabled by default [2].
> 
> It seems we even want to remove this warning as overaggressive [3],
> however it is indicative in this case.
> 
> Unfortunately, I couldn't easily reproduce the issue synthetically
> (tried both running the KVM selftests as well as gvisor KVM tests).
> Any help/pointers would be appreciated.
> 
> [1]: https://github.com/google/gvisor
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=71ba3f3189c78f756a659568fb473600fd78f207
> [3]: https://lore.kernel.org/kvm/20211129034317.2964790-5-stevensd@google.com/
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30  9:29 ` Paolo Bonzini
@ 2021-11-30 10:58   ` Ignat Korchagin
  2021-11-30 10:59     ` Ignat Korchagin
                       ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Ignat Korchagin @ 2021-11-30 10:58 UTC (permalink / raw)
  To: Paolo Bonzini, kvm; +Cc: stevensd, kernel-team

I have managed to reliably reproduce the issue on a QEMU VM (on a host
with nested virtualisation enabled). Here are the steps:

1. Install gvisor as per
https://gvisor.dev/docs/user_guide/install/#install-latest
2. Run
$ for i in $(seq 1 100); do sudo runsc --platform=kvm --network=none
do echo ok; done

I've tried to recompile the kernel with the above patch, but
unfortunately it does fix the issue. I'm happy to try other
patches/fixes queued for 5.16-rc4

Ignat

On Tue, Nov 30, 2021 at 9:29 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/29/21 22:44, Ignat Korchagin wrote:
> > Hello,
> >
> > We have recently started to evaluate 5.15.y kernel series and here is
> > what we occasionally get on kernel 5.15.5:
>
> I'm not sure if it's this, but there are quite a few fixes I've queued
> for 5.16-rc4, and I'll be sending a pull request to Linus shortly.  So
> we can revisit this in a week.
>
> This is the most likely fix:
>
> https://patchwork.kernel.org/project/kvm/patch/20211120045046.3940942-2-seanjc@google.com/
>
> Paolo
>
> > [177243.621744][T1995658] WARNING: CPU: 7 PID: 1995658 at
> > arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
> > kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
> > [177243.647435][T1995658] Modules linked in: xt_hashlimit xt_connlimit
> > nf_conncount ip_set_hash_netport xt_length esp4 sit ipip tunnel4
> > nft_numgen nft_ct ip_gre gre xfrm_user xfrm_algo tcp_diag udp_diag
> > inet_diag fou6 fou ip6_tunnel tunnel6 ip_tunnel ip6_udp_tunnel
> > udp_tunnel cls_bpf tls xt_NFLOG xt_statistic nft_compat veth tun
> > overlay macvlan sch_ingress raid0 md_mod essiv dm_crypt trusted
> > asn1_encoder tee dm_mod dax nfnetlink_log nft_log nft_limit
> > nft_counter nf_tables ip6table_nat ip6table_mangle ip6table_security
> > ip6table_raw ip6table_filter ip6_tables xt_nat iptable_nat nf_nat
> > xt_TCPMSS xt_u32 xt_connmark iptable_mangle xt_owner xt_CT iptable_raw
> > xt_state xt_bpf xt_mark xt_conntrack xt_multiport xt_comment xt_tcpudp
> > xt_set xt_tcpmss iptable_filter ip_set_hash_net ip_set_hash_ip ip_set
> > nfnetlink sch_fq tcp_bbr nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> > 8021q garp stp mrp llc skx_edac x86_pkg_temp_thermal kvm_intel kvm
> > irqbypass crc32_pclmul crc32c_intel aesni_intel rapl
> > [177243.647594][T1995658]  intel_cstate ipmi_ssif sfc intel_uncore
> > i2c_i801 xhci_pci i2c_smbus acpi_ipmi i40e mdio ioatdma i2c_core
> > xhci_hcd tpm_crb dca ipmi_si ipmi_devintf ipmi_msghandler tpm_tis
> > tpm_tis_core tpm tiny_power_button button fuse efivarfs ip_tables
> > x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
> > [177243.831600][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
> >        O      5.15.5-cloudflare-kasan-2021.11.11 #1
> > [177243.831609][T1995658] Hardware name: Quanta Cloud Technology Inc.
> > QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> > [177243.831612][T1995658] RIP:
> > 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
> > [177243.886990][T1995658] Code: 00 00 00 00 fc ff df 48 c1 ea 03 0f b6
> > 14 02 48 89 e8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 0f 8b 43 34 85
> > c0 74 03 5b 5d c3 <0f> 0b 5b 5d c3 48 89 ef e8 c5 68 72 e1 eb e7 e8 ce
> > 68 72 e1 eb 9b
> > [177243.919270][T1995658] RSP: 0018:ffff8881ec51f300 EFLAGS: 00010246
> > [177243.919276][T1995658] RAX: 0000000000000000 RBX: ffffea003d52e900
> > RCX: ffffffffc143242e
> > [177243.919279][T1995658] RDX: 0000000000000000 RSI: 0000000000000004
> > RDI: ffffea003d52e934
> > [177243.960080][T1995658] RBP: ffffea003d52e934 R08: 0000000000000000
> > R09: ffffea003d52e937
> > [177243.960083][T1995658] R10: fffff94007aa5d26 R11: 0000000000000000
> > R12: ffff88b03ffd9008
> > [177243.960085][T1995658] R13: 0600000f54ba4b77 R14: 0000000000000001
> > R15: 0600000f54ba4b01
> > [177243.960088][T1995658] FS:  0000000000000000(0000)
> > GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> > [177244.017687][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [177244.017691][T1995658] CR2: 000000c001503000 CR3: 0000001a1fb38006
> > CR4: 00000000007726e0
> > [177244.017693][T1995658] DR0: 0000000000000000 DR1: 0000000000000000
> > DR2: 0000000000000000
> > [177244.058771][T1995658] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> > DR7: 0000000000000400
> > [177244.058774][T1995658] PKRU: 00000000
> > [177244.058776][T1995658] Call Trace:
> > [177244.058780][T1995658]  <TASK>
> > [177244.058782][T1995658]  kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
> > [177244.111155][T1995658]  __handle_changed_spte+0x92e/0xca0 [kvm]
> > [177244.111274][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177244.133938][T1995658]  ? sched_clock_cpu+0x15/0x190
> > [177244.144289][T1995658]  ? _raw_spin_lock+0xc8/0xd0
> > [177244.144299][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > [177244.165796][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177244.165906][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> > [177244.187769][T1995658]  ? deref_stack_reg+0xe6/0x160
> > [177244.187779][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > [177244.209044][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177244.220140][T1995658]  ? update_curr+0x18d/0x5f0
> > [177244.220148][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > [177244.240784][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177244.251827][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> > [177244.251922][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> > [177244.273410][T1995658]  ? smp_call_function_single+0x271/0x370
> > [177244.283892][T1995658]  ? _raw_spin_lock+0x81/0xd0
> > [177244.283900][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> > [177244.283904][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> > [177244.313033][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> > [177244.323097][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> > [177244.332862][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> > [177244.341982][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> > [177244.350556][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> > [177244.360080][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> > [177244.369396][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> > [177244.378777][T1995658]  ?
> > kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> > [177244.389571][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> > [177244.398156][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> > [177244.398238][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> > [177244.416160][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> > [177244.424372][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> > [177244.432743][T1995658]  __fput+0x1f7/0x8c0
> > [177244.432749][T1995658]  task_work_run+0xf8/0x1a0
> > [177244.447257][T1995658]  do_exit+0x97b/0x2230
> > [177244.447263][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> > [177244.462773][T1995658]  ? mm_update_next_owner+0x750/0x750
> > [177244.471117][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> > [177244.471122][T1995658]  do_group_exit+0xda/0x2a0
> > [177244.471126][T1995658]  get_signal+0x3be/0x1e50
> > [177244.471133][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> > [177244.502424][T1995658]  ? audit_log_exit+0x2690/0x2690
> > [177244.502432][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> > [177244.518403][T1995658]  ? get_sigframe_size+0x10/0x10
> > [177244.526157][T1995658]  ? __seccomp_filter+0x117/0xd60
> > [177244.526162][T1995658]  ? audit_alloc_name+0x440/0x440
> > [177244.526166][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> > [177244.526170][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> > [177244.558471][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> > [177244.558478][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> > [177244.575193][T1995658]  do_syscall_64+0x4d/0x90
> > [177244.575197][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [177244.591189][T1995658] RIP: 0033:0x4890ca
> > [177244.591199][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> > [177244.591201][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> > ORIG_RAX: 000000000000011d
> > [177244.607705][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> > RCX: 00000000004890ca
> > [177244.630001][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> > RDI: 0000000000000010
> > [177244.630003][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> > R09: 0000000000000000
> > [177244.630005][T1995658] R10: 000000000676b000 R11: 0000000000000216
> > R12: 0000000000000009
> > [177244.630007][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> > R15: 0000000000000000
> > [177244.630011][T1995658]  </TASK>
> > [177244.679411][T1995658] ---[ end trace bf693a2532f213e4 ]---
> >
> > Then immediately this KASAN warning:
> >
> > [177244.798046][T1995658] BUG: KASAN: slab-out-of-bounds in
> > workingset_activation+0x2b2/0x2f0
> > [177244.809161][T1995658] Read of size 8 at addr ffff8881749ab3b8 by
> > task exe/1995658
> > [177244.819636][T1995658]
> > [177244.824947][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
> >     W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> > [177244.838672][T1995658] Hardware name: Quanta Cloud Technology Inc.
> > QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> > [177244.857210][T1995658] Call Trace:
> > [177244.863733][T1995658]  <TASK>
> > [177244.869871][T1995658]  dump_stack_lvl+0x34/0x44
> > [177244.877583][T1995658]  print_address_description.constprop.0+0x1f/0x140
> > [177244.887430][T1995658]  ? workingset_activation+0x2b2/0x2f0
> > [177244.896156][T1995658]  ? workingset_activation+0x2b2/0x2f0
> > [177244.904809][T1995658]  kasan_report.cold+0x83/0xdf
> > [177244.912778][T1995658]  ? kvm_is_zone_device_pfn.part.0+0x40/0xd0 [kvm]
> > [177244.922553][T1995658]  ? workingset_activation+0x2b2/0x2f0
> > [177244.931236][T1995658]  workingset_activation+0x2b2/0x2f0
> > [177244.939785][T1995658]  mark_page_accessed+0x44a/0x6a0
> > [177244.948022][T1995658]  __handle_changed_spte+0x64c/0xca0 [kvm]
> > [177244.957192][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177244.966249][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> > [177244.974930][T1995658]  ? deref_stack_reg+0xe6/0x160
> > [177244.983013][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > [177244.992167][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177245.001249][T1995658]  ? update_curr+0x18d/0x5f0
> > [177245.009268][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > [177245.018454][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177245.027556][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> > [177245.036055][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> > [177245.045907][T1995658]  ? smp_call_function_single+0x271/0x370
> > [177245.054992][T1995658]  ? _raw_spin_lock+0x81/0xd0
> > [177245.063045][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> > [177245.071332][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> > [177245.080861][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> > [177245.090147][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> > [177245.099215][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> > [177245.107719][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> > [177245.115776][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> > [177245.124919][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> > [177245.133974][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> > [177245.143143][T1995658]  ?
> > kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> > [177245.153754][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> > [177245.162289][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> > [177245.171426][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> > [177245.180637][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> > [177245.189087][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> > [177245.197744][T1995658]  __fput+0x1f7/0x8c0
> > [177245.205096][T1995658]  task_work_run+0xf8/0x1a0
> > [177245.212916][T1995658]  do_exit+0x97b/0x2230
> > [177245.220332][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> > [177245.229016][T1995658]  ? mm_update_next_owner+0x750/0x750
> > [177245.237583][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> > [177245.245731][T1995658]  do_group_exit+0xda/0x2a0
> > [177245.253265][T1995658]  get_signal+0x3be/0x1e50
> > [177245.260585][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> > [177245.269239][T1995658]  ? audit_log_exit+0x2690/0x2690
> > [177245.277218][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> > [177245.285295][T1995658]  ? get_sigframe_size+0x10/0x10
> > [177245.293162][T1995658]  ? __seccomp_filter+0x117/0xd60
> > [177245.301157][T1995658]  ? audit_alloc_name+0x440/0x440
> > [177245.309224][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> > [177245.317678][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> > [177245.325968][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> > [177245.334463][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> > [177245.342837][T1995658]  do_syscall_64+0x4d/0x90
> > [177245.350176][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [177245.358991][T1995658] RIP: 0033:0x4890ca
> > [177245.365778][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> > [177245.375596][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> > ORIG_RAX: 000000000000011d
> > [177245.387085][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> > RCX: 00000000004890ca
> > [177245.387092][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> > RDI: 0000000000000010
> > [177245.387096][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> > R09: 0000000000000000
> > [177245.387100][T1995658] R10: 000000000676b000 R11: 0000000000000216
> > R12: 0000000000000009
> > [177245.387103][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> > R15: 0000000000000000
> > [177245.441910][T1995658]  </TASK>
> > [177245.447826][T1995658]
> > [177245.447827][T1995658] Allocated by task 182586:
> > [177245.447830][T1995658]  kasan_save_stack+0x20/0x50
> > [177245.467811][T1995658]  __kasan_kmalloc+0xa4/0xd0
> > [177245.475233][T1995658]  fib6_info_alloc+0xa2/0x1d0
> > [177245.475240][T1995658]  ip6_route_info_create+0x29f/0x1a30
> > [177245.475243][T1995658]  ip6_route_add+0x18/0x100
> > [177245.475246][T1995658]  addrconf_add_mroute+0x157/0x1b0
> > [177245.506016][T1995658]  addrconf_notify+0x6a3/0x1510
> > [177245.506021][T1995658]  notifier_call_chain+0x9e/0x180
> > [177245.521427][T1995658]  __dev_notify_flags+0xda/0x230
> > [177245.529175][T1995658]  rtnl_configure_link+0x125/0x200
> > [177245.529181][T1995658]  __rtnl_newlink+0xd3d/0x13f0
> > [177245.529186][T1995658]  rtnl_newlink+0x5f/0x90
> > [177245.551464][T1995658]  rtnetlink_rcv_msg+0x378/0xa40
> > [177245.551469][T1995658]  netlink_rcv_skb+0x125/0x380
> > [177245.551472][T1995658]  netlink_unicast+0x4d0/0x7a0
> > [177245.573756][T1995658]  netlink_sendmsg+0x724/0xc00
> > [177245.573760][T1995658]  sock_sendmsg+0xe2/0x110
> > [177245.573764][T1995658]  __sys_sendto+0x1a8/0x270
> > [177245.595271][T1995658]  __x64_sys_sendto+0xdd/0x1b0
> > [177245.602600][T1995658]  do_syscall_64+0x40/0x90
> > [177245.609544][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [177245.609549][T1995658]
> > [177245.609550][T1995658] The buggy address belongs to the object at
> > ffff8881749ab200
> > [177245.609550][T1995658]  which belongs to the cache kmalloc-256 of size 256
> > [177245.609554][T1995658] The buggy address is located 184 bytes to the right of
> > [177245.609554][T1995658]  256-byte region [ffff8881749ab200, ffff8881749ab300)
> > [177245.609558][T1995658] The buggy address belongs to the page:
> > [177245.609559][T1995658] page:000000009030f8e1 refcount:1 mapcount:0
> > mapping:0000000000000000 index:0x0 pfn:0x1749a8
> > [177245.682919][T1995658] head:000000009030f8e1 order:2
> > compound_mapcount:0 compound_pincount:0
> > [177245.682924][T1995658] flags:
> > 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
> > [177245.705180][T1995658] raw: 002ffff800010200 dead000000000100
> > dead000000000122 ffff88810004cb40
> > [177245.716738][T1995658] raw: 0000000000000000 0000000000200020
> > 00000001ffffffff 0000000000000000
> > [177245.716741][T1995658] page dumped because: kasan: bad access detected
> > [177245.716743][T1995658]
> > [177245.716744][T1995658] Memory state around the buggy address:
> > [177245.716747][T1995658]  ffff8881749ab280: 00 00 00 00 00 00 00 00
> > 00 00 00 00 00 00 00 00
> > [177245.716749][T1995658]  ffff8881749ab300: fc fc fc fc fc fc fc fc
> > fc fc fc fc fc fc fc fc
> > [177245.716752][T1995658] >ffff8881749ab380: fc fc fc fc fc fc fc fc
> > fc fc fc fc fc fc fc fc
> > [177245.785224][T1995658]                                         ^
> > [177245.785229][T1995658]  ffff8881749ab400: 00 00 00 00 00 00 00 00
> > 00 00 00 00 00 00 00 00
> > [177245.785231][T1995658]  ffff8881749ab480: 00 00 00 00 00 00 00 00
> > 00 00 00 00 00 00 fc fc
> > [177245.816146][T1995658]
> > ==================================================================
> >
> > And after that:
> >
> > [177245.816196][T1995658] general protection fault, probably for
> > non-canonical address 0xdffffc0000000011: 0000 [#1] SMP KASAN PTI
> > [177245.854054][T1995658] KASAN: null-ptr-deref in range
> > [0x0000000000000088-0x000000000000008f]
> > [177245.865836][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
> > B   W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> > [177245.865842][T1995658] Hardware name: Quanta Cloud Technology Inc.
> > QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> > [177245.865845][T1995658] RIP: 0010:workingset_activation+0x175/0x2f0
> > [177245.909096][T1995658] Code: 80 3c 02 00 0f 85 58 01 00 00 4a 8b ac
> > ed b8 0f 00 00 48 8d bd 88 00 00 00 48 b8 00 00 00 00 00 fc ff df 48
> > 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 0d 01 00 00 4c 3b a5 88 00 00 00
> > 0f 85 c3 00 00
> > [177245.936452][T1995658] RSP: 0018:ffff8881ec51f3c0 EFLAGS: 00010206
> > [177245.936457][T1995658] RAX: dffffc0000000000 RBX: ffffea003237e480
> > RCX: ffffffffa25388a6
> > [177245.936460][T1995658] RDX: 0000000000000011 RSI: 0000000000000246
> > RDI: 0000000000000088
> > [177245.970834][T1995658] RBP: 0000000000000000 R08: 0000000000000001
> > R09: ffffffffa6ced907
> > [177245.970837][T1995658] R10: fffffbfff4d9db20 R11: 0000000000000010
> > R12: ffff88983ffde000
> > [177245.970839][T1995658] R13: 0000000000000000 R14: ffff8897a9bbbde0
> > R15: 0600000c8df93b77
> > [177246.007059][T1995658] FS:  0000000000000000(0000)
> > GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> > [177246.020326][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [177246.020330][T1995658] CR2: 000000c001503000 CR3: 0000001a1fb38006
> > CR4: 00000000007726e0
> > [177246.020332][T1995658] DR0: 0000000000000000 DR1: 0000000000000000
> > DR2: 0000000000000000
> > [177246.020334][T1995658] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> > DR7: 0000000000000400
> > [177246.020337][T1995658] PKRU: 00000000
> > [177246.020338][T1995658] Call Trace:
> > [177246.020341][T1995658]  <TASK>
> > [177246.090639][T1995658]  mark_page_accessed+0x44a/0x6a0
> > [177246.099947][T1995658]  __handle_changed_spte+0x64c/0xca0 [kvm]
> > [177246.110177][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177246.110274][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> > [177246.129880][T1995658]  ? deref_stack_reg+0xe6/0x160
> > [177246.138925][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > [177246.139020][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177246.158926][T1995658]  ? update_curr+0x18d/0x5f0
> > [177246.167666][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > [177246.167762][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > [177246.187453][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> > [177246.196673][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> > [177246.196781][T1995658]  ? smp_call_function_single+0x271/0x370
> > [177246.196789][T1995658]  ? _raw_spin_lock+0x81/0xd0
> > [177246.196795][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> > [177246.234516][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> > [177246.244565][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> > [177246.244680][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> > [177246.263832][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> > [177246.263923][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> > [177246.281421][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> > [177246.291062][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> > [177246.291156][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> > [177246.310225][T1995658]  ?
> > kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> > [177246.310297][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> > [177246.330051][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> > [177246.339422][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> > [177246.339434][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> > [177246.357081][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> > [177246.365745][T1995658]  __fput+0x1f7/0x8c0
> > [177246.365751][T1995658]  task_work_run+0xf8/0x1a0
> > [177246.380735][T1995658]  do_exit+0x97b/0x2230
> > [177246.388056][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> > [177246.388062][T1995658]  ? mm_update_next_owner+0x750/0x750
> > [177246.388067][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> > [177246.413105][T1995658]  do_group_exit+0xda/0x2a0
> > [177246.420604][T1995658]  get_signal+0x3be/0x1e50
> > [177246.427936][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> > [177246.427943][T1995658]  ? audit_log_exit+0x2690/0x2690
> > [177246.444593][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> > [177246.444601][T1995658]  ? get_sigframe_size+0x10/0x10
> > [177246.460403][T1995658]  ? __seccomp_filter+0x117/0xd60
> > [177246.468231][T1995658]  ? audit_alloc_name+0x440/0x440
> > [177246.468236][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> > [177246.468239][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> > [177246.468244][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> > [177246.486130][T2343516] BUG: Bad page state in process dnsdist  pfn:3ac443
> > [177246.492473][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> > [177246.500835][T2343516] page:000000000df8ed4b refcount:0 mapcount:0
> > mapping:0000000000000000 index:0x1 pfn:0x3ac443
> > [177246.510416][T1995658]  do_syscall_64+0x4d/0x90
> > [177246.518772][T2343516] flags:
> > 0x2ffff80000000a(referenced|dirty|node=0|zone=2|lastcpupid=0x1ffff)
> > [177246.532071][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [177246.539574][T2343516] raw: 002ffff80000000a dead000000000100
> > dead000000000122 0000000000000000
> > [177246.551516][T1995658] RIP: 0033:0x4890ca
> > [177246.560640][T2343516] raw: 0000000000000001 0000000000000000
> > 00000000ffffffff 0000000000000000
> > [177246.572553][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> > [177246.572556][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> > [177246.579708][T2343516] page dumped because:
> > PAGE_FLAGS_CHECK_AT_PREP flag(s) set
> > [177246.591667][T1995658]  ORIG_RAX: 000000000000011d
> > [177246.601947][T2343516] Modules linked in: xt_hashlimit
> > [177246.611442][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> > RCX: 00000000004890ca
> > [177246.622186][T2343516]  xt_connlimit nf_conncount
> > [177246.630505][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> > RDI: 0000000000000010
> > [177246.639127][T2343516]  ip_set_hash_netport xt_length
> > [177246.650831][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> > R09: 0000000000000000
> > [177246.659146][T2343516]  esp4 sit
> > [177246.670801][T1995658] R10: 000000000676b000 R11: 0000000000000216
> > R12: 0000000000000009
> > [177246.679548][T2343516]  ipip tunnel4
> > [177246.691327][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> > R15: 0000000000000000
> > [177246.691332][T1995658]  </TASK>
> > [177246.698177][T2343516]  nft_numgen
> > [177246.710054][T1995658] Modules linked in: xt_hashlimit
> > [177246.717376][T2343516]  nft_ct ip_gre
> > [177246.729337][T1995658]  xt_connlimit nf_conncount
> > [177246.736385][T2343516]  gre xfrm_user
> > [177246.743566][T1995658]  ip_set_hash_netport xt_length
> > [177246.752511][T2343516]  xfrm_algo tcp_diag
> > [177246.759923][T1995658]  esp4 sit
> > [177246.768447][T2343516]  udp_diag inet_diag
> > [177246.775861][T1995658]  ipip tunnel4
> > [177246.784702][T2343516]  fou6 fou
> > [177246.792590][T1995658]  nft_numgen nft_ct
> > [177246.799633][T2343516]  ip6_tunnel tunnel6
> > [177246.807521][T1995658]  ip_gre gre
> > [177246.814891][T2343516]  ip_tunnel
> > [177246.821906][T1995658]  xfrm_user xfrm_algo
> > [177246.829831][T2343516]  ip6_udp_tunnel udp_tunnel
> > [177246.837687][T1995658]  tcp_diag udp_diag
> > [177246.844899][T2343516]  cls_bpf tls
> > [177246.851988][T1995658]  inet_diag fou6
> > [177246.860054][T2343516]  xt_NFLOG
> > [177246.868476][T1995658]  fou ip6_tunnel
> > [177246.876205][T2343516]  xt_statistic
> > [177246.883435][T1995658]  tunnel6 ip_tunnel
> > [177246.890967][T2343516]  nft_compat
> > [177246.897928][T1995658]  ip6_udp_tunnel udp_tunnel
> > [177246.905366][T2343516]  veth
> > [177246.912707][T1995658]  cls_bpf tls
> > [177246.920424][T2343516]  tun overlay
> > [177246.927451][T1995658]  xt_NFLOG xt_statistic
> > [177246.935692][T2343516]  macvlan
> > [177246.941990][T1995658]  nft_compat veth
> > [177246.948848][T2343516]  sch_ingress raid0
> > [177246.955617][T1995658]  tun overlay
> > [177246.963226][T2343516]  md_mod essiv
> > [177246.969540][T1995658]  macvlan sch_ingress
> > [177246.976443][T2343516]  dm_crypt trusted
> > [177246.983523][T1995658]  raid0 md_mod
> > [177246.990026][T2343516]  asn1_encoder tee
> > [177246.996500][T1995658]  essiv dm_crypt
> > [177247.003663][T2343516]  dm_mod dax
> > [177247.010435][T1995658]  trusted asn1_encoder
> > [177247.016794][T2343516]  nfnetlink_log nft_log
> > [177247.023445][T1995658]  tee dm_mod
> > [177247.029892][T2343516]  nft_limit
> > [177247.035861][T1995658]  dax nfnetlink_log
> > [177247.042663][T2343516]  nft_counter nf_tables
> > [177247.049640][T1995658]  nft_log
> > [177247.055470][T2343516]  ip6table_nat
> > [177247.061129][T1995658]  nft_limit nft_counter
> > [177247.067401][T2343516]  ip6table_mangle
> > [177247.073987][T1995658]  nf_tables ip6table_nat
> > [177247.079498][T2343516]  ip6table_security ip6table_raw
> > [177247.085277][T1995658]  ip6table_mangle
> > [177247.091991][T2343516]  ip6table_filter ip6_tables
> > [177247.098034][T1995658]  ip6table_security ip6table_raw
> > [177247.104695][T2343516]  xt_nat iptable_nat
> > [177247.112034][T1995658]  ip6table_filter ip6_tables
> > [177247.118039][T2343516]  nf_nat xt_TCPMSS
> > [177247.125004][T1995658]  xt_nat iptable_nat
> > [177247.132376][T2343516]  xt_u32
> > [177247.138705][T1995658]  nf_nat xt_TCPMSS xt_u32
> > [177247.145715][T2343516]  xt_connmark
> > [177247.151843][T1995658]  xt_connmark iptable_mangle xt_owner
> > [177247.158146][T2343516]  iptable_mangle
> > [177247.163422][T1995658]  xt_CT iptable_raw
> > [177247.170185][T2343516]  xt_owner xt_CT
> > [177247.175921][T1995658]  xt_state xt_bpf
> > [177247.183751][T2343516]  iptable_raw xt_state
> > [177247.189765][T1995658]  xt_mark xt_conntrack
> > [177247.196014][T2343516]  xt_bpf xt_mark
> > [177247.202024][T1995658]  xt_multiport xt_comment xt_tcpudp
> > [177247.208091][T2343516]  xt_conntrack
> > [177247.214620][T1995658]  xt_set xt_tcpmss iptable_filter
> > [177247.221195][T2343516]  xt_multiport
> > [177247.227181][T1995658]  ip_set_hash_net
> > [177247.234862][T2343516]  xt_comment xt_tcpudp
> > [177247.240694][T1995658]  ip_set_hash_ip ip_set
> > [177247.248244][T2343516]  xt_set xt_tcpmss
> > [177247.254108][T1995658]  nfnetlink sch_fq
> > [177247.260257][T2343516]  iptable_filter ip_set_hash_net
> > [177247.266823][T1995658]  tcp_bbr nf_conntrack
> > [177247.273564][T2343516]  ip_set_hash_ip ip_set
> > [177247.279798][T1995658]  nf_defrag_ipv6 nf_defrag_ipv4
> > [177247.286054][T2343516]  nfnetlink sch_fq
> > [177247.293537][T1995658]  8021q garp
> > [177247.300146][T2343516]  tcp_bbr nf_conntrack
> > [177247.306851][T1995658]  stp mrp
> > [177247.314280][T2343516]  nf_defrag_ipv6
> > [177247.320726][T1995658]  llc skx_edac
> > [177247.326501][T2343516]  nf_defrag_ipv4
> > [177247.333200][T1995658]  x86_pkg_temp_thermal
> > [177247.338746][T2343516]  8021q garp
> > [177247.344900][T1995658]  kvm_intel kvm
> > [177247.350878][T2343516]  stp mrp
> > [177247.356981][T1995658]  irqbypass crc32_pclmul
> > [177247.363642][T2343516]  llc skx_edac
> > [177247.369407][T1995658]  crc32c_intel aesni_intel
> > [177247.375433][T2343516]  x86_pkg_temp_thermal kvm_intel
> > [177247.380949][T1995658]  rapl intel_cstate
> > [177247.387818][T2343516]  kvm irqbypass
> > [177247.393792][T1995658]  ipmi_ssif sfc
> > [177247.400823][T2343516]  crc32_pclmul crc32c_intel
> > [177247.408402][T1995658]  intel_uncore
> > [177247.414816][T2343516]  aesni_intel rapl
> > [177247.420895][T1995658]  i2c_i801 xhci_pci
> > [177247.426998][T2343516]  intel_cstate
> > [177247.434174][T1995658]  i2c_smbus acpi_ipmi
> > [177247.440133][T2343516]  ipmi_ssif
> > [177247.446447][T1995658]  i40e mdio
> > [177247.452828][T2343516]  sfc intel_uncore
> > [177247.458764][T1995658]  ioatdma i2c_core xhci_hcd tpm_crb
> > [177247.465315][T2343516]  i2c_i801
> > [177247.470986][T1995658]  dca ipmi_si
> > [177247.476630][T2343516]  xhci_pci i2c_smbus
> > [177247.482924][T1995658]  ipmi_devintf ipmi_msghandler
> > [177247.490884][T2343516]  acpi_ipmi
> > [177247.496448][T1995658]  tpm_tis tpm_tis_core
> > [177247.502266][T2343516]  i40e
> > [177247.508870][T1995658]  tpm tiny_power_button
> > [177247.516227][T2343516]  mdio
> > [177247.521960][T1995658]  button fuse
> > [177247.528752][T2343516]  ioatdma i2c_core
> > [177247.534017][T1995658]  efivarfs
> > [177247.540736][T2343516]  xhci_hcd tpm_crb
> > [177247.545945][T1995658]  ip_tables x_tables
> > [177247.551808][T2343516]  dca ipmi_si
> > [177247.558097][T1995658]  bcmcrypt(O)
> > [177247.563798][T2343516]  ipmi_devintf ipmi_msghandler
> > [177247.570085][T1995658]  crypto_simd
> > [177247.576506][T2343516]  tpm_tis tpm_tis_core
> > [177247.582311][T1995658]  cryptd [last unloaded: kheaders]
> > [177247.582349][T1995658] ---[ end trace bf693a2532f213e5 ]---
> > [177247.588090][T2343516]  tpm tiny_power_button button fuse efivarfs ip_tables
> > [177247.671277][T1995658] RIP: 0010:workingset_activation+0x175/0x2f0
> > [177247.673366][T2343516]  x_tables bcmcrypt(O) crypto_simd
> > [177247.682978][T1995658] Code: 80 3c 02 00 0f 85 58 01 00 00 4a 8b ac
> > ed b8 0f 00 00 48 8d bd 88 00 00 00 48 b8 00 00 00 00 00 fc ff df 48
> > 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 0d 01 00 00 4c 3b a5 88 00 00 00
> > 0f 85 c3 00 00
> > [177247.691625][T2343516]  cryptd [last unloaded: kheaders]
> > [177247.691633][T2343516] CPU: 30 PID: 2343516 Comm: dnsdist Tainted:
> > G    B D W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> > [177247.691639][T2343516] Hardware name: Quanta Cloud Technology Inc.
> > QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> > [177247.691642][T2343516] Call Trace:
> > [177247.691645][T2343516]  <TASK>
> > [177247.699575][T1995658] RSP: 0018:ffff8881ec51f3c0 EFLAGS: 00010206
> > [177247.724584][T2343516]  dump_stack_lvl+0x34/0x44
> > [177247.724595][T2343516]  bad_page.cold+0xc0/0xe1
> > [177247.732606][T1995658]
> > [177247.746616][T2343516]  rmqueue_bulk+0x8e5/0xe00
> > [177247.746627][T2343516]  ? find_suitable_fallback+0x470/0x470
> > [177247.764885][T1995658] RAX: dffffc0000000000 RBX: ffffea003237e480
> > RCX: ffffffffa25388a6
> > [177247.771268][T2343516]  get_page_from_freelist+0x18ff/0x2920
> > [177247.771279][T2343516]  ? __zone_watermark_ok+0x340/0x340
> > [177247.777285][T1995658] RDX: 0000000000000011 RSI: 0000000000000246
> > RDI: 0000000000000088
> > [177247.786537][T2343516]  __alloc_pages+0x2ac/0x5b0
> > [177247.786543][T2343516]  ? __alloc_pages_slowpath.constprop.0+0x1e20/0x1e20
> > [177247.786548][T2343516]  alloc_pages_vma+0xbc/0x570
> > [177247.794244][T1995658] RBP: 0000000000000000 R08: 0000000000000001
> > R09: ffffffffa6ced907
> > [177247.801816][T2343516]  __handle_mm_fault+0x14d6/0x3a00
> > [177247.801822][T2343516]  ? vm_iomap_memory+0x1d0/0x1d0
> > [177247.801827][T2343516]  ? down_read_trylock+0xeb/0x180
> > [177247.807247][T1995658] R10: fffffbfff4d9db20 R11: 0000000000000010
> > R12: ffff88983ffde000
> > [177247.814873][T2343516]  handle_mm_fault+0x1cc/0x650
> > [177247.814878][T2343516]  do_user_addr_fault+0x303/0xd40
> > [177247.814885][T2343516]  exc_page_fault+0x52/0xb0
> > [177247.823577][T1995658] R13: 0000000000000000 R14: ffff8897a9bbbde0
> > R15: 0600000c8df93b77
> > [177247.834904][T2343516]  ? asm_exc_page_fault+0x8/0x30
> > [177247.834911][T2343516]  asm_exc_page_fault+0x1e/0x30
> > [177247.834915][T2343516] RIP: 0033:0x557a8ea64f19
> > [177247.843738][T1995658] FS:  0000000000000000(0000)
> > GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> > [177247.852290][T2343516] Code: c7 80 04 ff ff ff 00 00 00 00 c7 80 18
> > ff ff ff 00 00 00 00 48 c7 80 20 ff ff ff 00 00 00 00 48 c7 80 28 ff
> > ff ff 00 00 00 00 <c6> 80 30 ff ff ff 01 c6 80 38 ff ff ff 01 c6 80 39
> > ff ff ff 00 48
> > [177247.852295][T2343516] RSP: 002b:00007fff6f3f11c0 EFLAGS: 00010246
> > [177247.852299][T2343516] RAX: 0000557a9840d0d0 RBX: 0000557a92334cc0
> > RCX: 0000000000000000
> > [177247.852301][T2343516] RDX: ffffffffffffffff RSI: 0000000000000000
> > RDI: 0000000000000000
> > [177247.852304][T2343516] RBP: 00000000000099d4 R08: 0000000000000000
> > R09: 0000000000000000
> > [177247.863709][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [177247.871673][T2343516] R10: 0000000000000000 R11: 0000000000000000
> > R12: 0000557a991452f0
> > [177247.871676][T2343516] R13: 00007fff6f3f1420 R14: 00007fff6f3f1440
> > R15: 0000000000000001
> > [177247.871680][T2343516]  </TASK>
> >
> > After this the machine starts spitting some traces starting with:
> >
> > [177247.871683][T2343516] BUG: Bad page state in process <some comm
> > name>  pfn:fe680a
> >
> > And eventually gradually locks up:
> >
> > NMI watchdog: Watchdog detected hard LOCKUP on cpu 81
> >
> > The comment in kvm_main.c before the code mentioned in the first
> > warning states that the warning is there to indicate incorrect usage
> > of the function - and probably it is, given the consequences.
> >
> > About our workload: this bug is most likely triggered by gvisor [1]
> > with the KVM backend as we don't have any other KVM users on these
> > systems.
> >
> > We suspect it was not triggered before as kernels before 5.15 did not
> > have TDP MMU enabled by default [2].
> >
> > It seems we even want to remove this warning as overaggressive [3],
> > however it is indicative in this case.
> >
> > Unfortunately, I couldn't easily reproduce the issue synthetically
> > (tried both running the KVM selftests as well as gvisor KVM tests).
> > Any help/pointers would be appreciated.
> >
> > [1]: https://github.com/google/gvisor
> > [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=71ba3f3189c78f756a659568fb473600fd78f207
> > [3]: https://lore.kernel.org/kvm/20211129034317.2964790-5-stevensd@google.com/
> >
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30 10:58   ` Ignat Korchagin
@ 2021-11-30 10:59     ` Ignat Korchagin
  2021-11-30 11:11     ` Paolo Bonzini
  2021-11-30 20:23     ` Sean Christopherson
  2 siblings, 0 replies; 20+ messages in thread
From: Ignat Korchagin @ 2021-11-30 10:59 UTC (permalink / raw)
  To: Paolo Bonzini, kvm; +Cc: stevensd, kernel-team

typo: "does" -> "does not"

Ignat

On Tue, Nov 30, 2021 at 10:58 AM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> I have managed to reliably reproduce the issue on a QEMU VM (on a host
> with nested virtualisation enabled). Here are the steps:
>
> 1. Install gvisor as per
> https://gvisor.dev/docs/user_guide/install/#install-latest
> 2. Run
> $ for i in $(seq 1 100); do sudo runsc --platform=kvm --network=none
> do echo ok; done
>
> I've tried to recompile the kernel with the above patch, but
> unfortunately it does fix the issue. I'm happy to try other
> patches/fixes queued for 5.16-rc4
>
> Ignat
>
> On Tue, Nov 30, 2021 at 9:29 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 11/29/21 22:44, Ignat Korchagin wrote:
> > > Hello,
> > >
> > > We have recently started to evaluate 5.15.y kernel series and here is
> > > what we occasionally get on kernel 5.15.5:
> >
> > I'm not sure if it's this, but there are quite a few fixes I've queued
> > for 5.16-rc4, and I'll be sending a pull request to Linus shortly.  So
> > we can revisit this in a week.
> >
> > This is the most likely fix:
> >
> > https://patchwork.kernel.org/project/kvm/patch/20211120045046.3940942-2-seanjc@google.com/
> >
> > Paolo
> >
> > > [177243.621744][T1995658] WARNING: CPU: 7 PID: 1995658 at
> > > arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
> > > kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
> > > [177243.647435][T1995658] Modules linked in: xt_hashlimit xt_connlimit
> > > nf_conncount ip_set_hash_netport xt_length esp4 sit ipip tunnel4
> > > nft_numgen nft_ct ip_gre gre xfrm_user xfrm_algo tcp_diag udp_diag
> > > inet_diag fou6 fou ip6_tunnel tunnel6 ip_tunnel ip6_udp_tunnel
> > > udp_tunnel cls_bpf tls xt_NFLOG xt_statistic nft_compat veth tun
> > > overlay macvlan sch_ingress raid0 md_mod essiv dm_crypt trusted
> > > asn1_encoder tee dm_mod dax nfnetlink_log nft_log nft_limit
> > > nft_counter nf_tables ip6table_nat ip6table_mangle ip6table_security
> > > ip6table_raw ip6table_filter ip6_tables xt_nat iptable_nat nf_nat
> > > xt_TCPMSS xt_u32 xt_connmark iptable_mangle xt_owner xt_CT iptable_raw
> > > xt_state xt_bpf xt_mark xt_conntrack xt_multiport xt_comment xt_tcpudp
> > > xt_set xt_tcpmss iptable_filter ip_set_hash_net ip_set_hash_ip ip_set
> > > nfnetlink sch_fq tcp_bbr nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> > > 8021q garp stp mrp llc skx_edac x86_pkg_temp_thermal kvm_intel kvm
> > > irqbypass crc32_pclmul crc32c_intel aesni_intel rapl
> > > [177243.647594][T1995658]  intel_cstate ipmi_ssif sfc intel_uncore
> > > i2c_i801 xhci_pci i2c_smbus acpi_ipmi i40e mdio ioatdma i2c_core
> > > xhci_hcd tpm_crb dca ipmi_si ipmi_devintf ipmi_msghandler tpm_tis
> > > tpm_tis_core tpm tiny_power_button button fuse efivarfs ip_tables
> > > x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
> > > [177243.831600][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
> > >        O      5.15.5-cloudflare-kasan-2021.11.11 #1
> > > [177243.831609][T1995658] Hardware name: Quanta Cloud Technology Inc.
> > > QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> > > [177243.831612][T1995658] RIP:
> > > 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
> > > [177243.886990][T1995658] Code: 00 00 00 00 fc ff df 48 c1 ea 03 0f b6
> > > 14 02 48 89 e8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 0f 8b 43 34 85
> > > c0 74 03 5b 5d c3 <0f> 0b 5b 5d c3 48 89 ef e8 c5 68 72 e1 eb e7 e8 ce
> > > 68 72 e1 eb 9b
> > > [177243.919270][T1995658] RSP: 0018:ffff8881ec51f300 EFLAGS: 00010246
> > > [177243.919276][T1995658] RAX: 0000000000000000 RBX: ffffea003d52e900
> > > RCX: ffffffffc143242e
> > > [177243.919279][T1995658] RDX: 0000000000000000 RSI: 0000000000000004
> > > RDI: ffffea003d52e934
> > > [177243.960080][T1995658] RBP: ffffea003d52e934 R08: 0000000000000000
> > > R09: ffffea003d52e937
> > > [177243.960083][T1995658] R10: fffff94007aa5d26 R11: 0000000000000000
> > > R12: ffff88b03ffd9008
> > > [177243.960085][T1995658] R13: 0600000f54ba4b77 R14: 0000000000000001
> > > R15: 0600000f54ba4b01
> > > [177243.960088][T1995658] FS:  0000000000000000(0000)
> > > GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> > > [177244.017687][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [177244.017691][T1995658] CR2: 000000c001503000 CR3: 0000001a1fb38006
> > > CR4: 00000000007726e0
> > > [177244.017693][T1995658] DR0: 0000000000000000 DR1: 0000000000000000
> > > DR2: 0000000000000000
> > > [177244.058771][T1995658] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> > > DR7: 0000000000000400
> > > [177244.058774][T1995658] PKRU: 00000000
> > > [177244.058776][T1995658] Call Trace:
> > > [177244.058780][T1995658]  <TASK>
> > > [177244.058782][T1995658]  kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
> > > [177244.111155][T1995658]  __handle_changed_spte+0x92e/0xca0 [kvm]
> > > [177244.111274][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177244.133938][T1995658]  ? sched_clock_cpu+0x15/0x190
> > > [177244.144289][T1995658]  ? _raw_spin_lock+0xc8/0xd0
> > > [177244.144299][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > > [177244.165796][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177244.165906][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> > > [177244.187769][T1995658]  ? deref_stack_reg+0xe6/0x160
> > > [177244.187779][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > > [177244.209044][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177244.220140][T1995658]  ? update_curr+0x18d/0x5f0
> > > [177244.220148][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > > [177244.240784][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177244.251827][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> > > [177244.251922][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> > > [177244.273410][T1995658]  ? smp_call_function_single+0x271/0x370
> > > [177244.283892][T1995658]  ? _raw_spin_lock+0x81/0xd0
> > > [177244.283900][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> > > [177244.283904][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> > > [177244.313033][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> > > [177244.323097][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> > > [177244.332862][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> > > [177244.341982][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> > > [177244.350556][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> > > [177244.360080][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> > > [177244.369396][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> > > [177244.378777][T1995658]  ?
> > > kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> > > [177244.389571][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> > > [177244.398156][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> > > [177244.398238][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> > > [177244.416160][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> > > [177244.424372][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> > > [177244.432743][T1995658]  __fput+0x1f7/0x8c0
> > > [177244.432749][T1995658]  task_work_run+0xf8/0x1a0
> > > [177244.447257][T1995658]  do_exit+0x97b/0x2230
> > > [177244.447263][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> > > [177244.462773][T1995658]  ? mm_update_next_owner+0x750/0x750
> > > [177244.471117][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> > > [177244.471122][T1995658]  do_group_exit+0xda/0x2a0
> > > [177244.471126][T1995658]  get_signal+0x3be/0x1e50
> > > [177244.471133][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> > > [177244.502424][T1995658]  ? audit_log_exit+0x2690/0x2690
> > > [177244.502432][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> > > [177244.518403][T1995658]  ? get_sigframe_size+0x10/0x10
> > > [177244.526157][T1995658]  ? __seccomp_filter+0x117/0xd60
> > > [177244.526162][T1995658]  ? audit_alloc_name+0x440/0x440
> > > [177244.526166][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> > > [177244.526170][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> > > [177244.558471][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> > > [177244.558478][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> > > [177244.575193][T1995658]  do_syscall_64+0x4d/0x90
> > > [177244.575197][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [177244.591189][T1995658] RIP: 0033:0x4890ca
> > > [177244.591199][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> > > [177244.591201][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> > > ORIG_RAX: 000000000000011d
> > > [177244.607705][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> > > RCX: 00000000004890ca
> > > [177244.630001][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> > > RDI: 0000000000000010
> > > [177244.630003][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> > > R09: 0000000000000000
> > > [177244.630005][T1995658] R10: 000000000676b000 R11: 0000000000000216
> > > R12: 0000000000000009
> > > [177244.630007][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> > > R15: 0000000000000000
> > > [177244.630011][T1995658]  </TASK>
> > > [177244.679411][T1995658] ---[ end trace bf693a2532f213e4 ]---
> > >
> > > Then immediately this KASAN warning:
> > >
> > > [177244.798046][T1995658] BUG: KASAN: slab-out-of-bounds in
> > > workingset_activation+0x2b2/0x2f0
> > > [177244.809161][T1995658] Read of size 8 at addr ffff8881749ab3b8 by
> > > task exe/1995658
> > > [177244.819636][T1995658]
> > > [177244.824947][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
> > >     W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> > > [177244.838672][T1995658] Hardware name: Quanta Cloud Technology Inc.
> > > QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> > > [177244.857210][T1995658] Call Trace:
> > > [177244.863733][T1995658]  <TASK>
> > > [177244.869871][T1995658]  dump_stack_lvl+0x34/0x44
> > > [177244.877583][T1995658]  print_address_description.constprop.0+0x1f/0x140
> > > [177244.887430][T1995658]  ? workingset_activation+0x2b2/0x2f0
> > > [177244.896156][T1995658]  ? workingset_activation+0x2b2/0x2f0
> > > [177244.904809][T1995658]  kasan_report.cold+0x83/0xdf
> > > [177244.912778][T1995658]  ? kvm_is_zone_device_pfn.part.0+0x40/0xd0 [kvm]
> > > [177244.922553][T1995658]  ? workingset_activation+0x2b2/0x2f0
> > > [177244.931236][T1995658]  workingset_activation+0x2b2/0x2f0
> > > [177244.939785][T1995658]  mark_page_accessed+0x44a/0x6a0
> > > [177244.948022][T1995658]  __handle_changed_spte+0x64c/0xca0 [kvm]
> > > [177244.957192][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177244.966249][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> > > [177244.974930][T1995658]  ? deref_stack_reg+0xe6/0x160
> > > [177244.983013][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > > [177244.992167][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177245.001249][T1995658]  ? update_curr+0x18d/0x5f0
> > > [177245.009268][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > > [177245.018454][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177245.027556][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> > > [177245.036055][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> > > [177245.045907][T1995658]  ? smp_call_function_single+0x271/0x370
> > > [177245.054992][T1995658]  ? _raw_spin_lock+0x81/0xd0
> > > [177245.063045][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> > > [177245.071332][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> > > [177245.080861][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> > > [177245.090147][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> > > [177245.099215][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> > > [177245.107719][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> > > [177245.115776][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> > > [177245.124919][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> > > [177245.133974][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> > > [177245.143143][T1995658]  ?
> > > kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> > > [177245.153754][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> > > [177245.162289][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> > > [177245.171426][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> > > [177245.180637][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> > > [177245.189087][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> > > [177245.197744][T1995658]  __fput+0x1f7/0x8c0
> > > [177245.205096][T1995658]  task_work_run+0xf8/0x1a0
> > > [177245.212916][T1995658]  do_exit+0x97b/0x2230
> > > [177245.220332][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> > > [177245.229016][T1995658]  ? mm_update_next_owner+0x750/0x750
> > > [177245.237583][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> > > [177245.245731][T1995658]  do_group_exit+0xda/0x2a0
> > > [177245.253265][T1995658]  get_signal+0x3be/0x1e50
> > > [177245.260585][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> > > [177245.269239][T1995658]  ? audit_log_exit+0x2690/0x2690
> > > [177245.277218][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> > > [177245.285295][T1995658]  ? get_sigframe_size+0x10/0x10
> > > [177245.293162][T1995658]  ? __seccomp_filter+0x117/0xd60
> > > [177245.301157][T1995658]  ? audit_alloc_name+0x440/0x440
> > > [177245.309224][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> > > [177245.317678][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> > > [177245.325968][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> > > [177245.334463][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> > > [177245.342837][T1995658]  do_syscall_64+0x4d/0x90
> > > [177245.350176][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [177245.358991][T1995658] RIP: 0033:0x4890ca
> > > [177245.365778][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> > > [177245.375596][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> > > ORIG_RAX: 000000000000011d
> > > [177245.387085][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> > > RCX: 00000000004890ca
> > > [177245.387092][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> > > RDI: 0000000000000010
> > > [177245.387096][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> > > R09: 0000000000000000
> > > [177245.387100][T1995658] R10: 000000000676b000 R11: 0000000000000216
> > > R12: 0000000000000009
> > > [177245.387103][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> > > R15: 0000000000000000
> > > [177245.441910][T1995658]  </TASK>
> > > [177245.447826][T1995658]
> > > [177245.447827][T1995658] Allocated by task 182586:
> > > [177245.447830][T1995658]  kasan_save_stack+0x20/0x50
> > > [177245.467811][T1995658]  __kasan_kmalloc+0xa4/0xd0
> > > [177245.475233][T1995658]  fib6_info_alloc+0xa2/0x1d0
> > > [177245.475240][T1995658]  ip6_route_info_create+0x29f/0x1a30
> > > [177245.475243][T1995658]  ip6_route_add+0x18/0x100
> > > [177245.475246][T1995658]  addrconf_add_mroute+0x157/0x1b0
> > > [177245.506016][T1995658]  addrconf_notify+0x6a3/0x1510
> > > [177245.506021][T1995658]  notifier_call_chain+0x9e/0x180
> > > [177245.521427][T1995658]  __dev_notify_flags+0xda/0x230
> > > [177245.529175][T1995658]  rtnl_configure_link+0x125/0x200
> > > [177245.529181][T1995658]  __rtnl_newlink+0xd3d/0x13f0
> > > [177245.529186][T1995658]  rtnl_newlink+0x5f/0x90
> > > [177245.551464][T1995658]  rtnetlink_rcv_msg+0x378/0xa40
> > > [177245.551469][T1995658]  netlink_rcv_skb+0x125/0x380
> > > [177245.551472][T1995658]  netlink_unicast+0x4d0/0x7a0
> > > [177245.573756][T1995658]  netlink_sendmsg+0x724/0xc00
> > > [177245.573760][T1995658]  sock_sendmsg+0xe2/0x110
> > > [177245.573764][T1995658]  __sys_sendto+0x1a8/0x270
> > > [177245.595271][T1995658]  __x64_sys_sendto+0xdd/0x1b0
> > > [177245.602600][T1995658]  do_syscall_64+0x40/0x90
> > > [177245.609544][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [177245.609549][T1995658]
> > > [177245.609550][T1995658] The buggy address belongs to the object at
> > > ffff8881749ab200
> > > [177245.609550][T1995658]  which belongs to the cache kmalloc-256 of size 256
> > > [177245.609554][T1995658] The buggy address is located 184 bytes to the right of
> > > [177245.609554][T1995658]  256-byte region [ffff8881749ab200, ffff8881749ab300)
> > > [177245.609558][T1995658] The buggy address belongs to the page:
> > > [177245.609559][T1995658] page:000000009030f8e1 refcount:1 mapcount:0
> > > mapping:0000000000000000 index:0x0 pfn:0x1749a8
> > > [177245.682919][T1995658] head:000000009030f8e1 order:2
> > > compound_mapcount:0 compound_pincount:0
> > > [177245.682924][T1995658] flags:
> > > 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
> > > [177245.705180][T1995658] raw: 002ffff800010200 dead000000000100
> > > dead000000000122 ffff88810004cb40
> > > [177245.716738][T1995658] raw: 0000000000000000 0000000000200020
> > > 00000001ffffffff 0000000000000000
> > > [177245.716741][T1995658] page dumped because: kasan: bad access detected
> > > [177245.716743][T1995658]
> > > [177245.716744][T1995658] Memory state around the buggy address:
> > > [177245.716747][T1995658]  ffff8881749ab280: 00 00 00 00 00 00 00 00
> > > 00 00 00 00 00 00 00 00
> > > [177245.716749][T1995658]  ffff8881749ab300: fc fc fc fc fc fc fc fc
> > > fc fc fc fc fc fc fc fc
> > > [177245.716752][T1995658] >ffff8881749ab380: fc fc fc fc fc fc fc fc
> > > fc fc fc fc fc fc fc fc
> > > [177245.785224][T1995658]                                         ^
> > > [177245.785229][T1995658]  ffff8881749ab400: 00 00 00 00 00 00 00 00
> > > 00 00 00 00 00 00 00 00
> > > [177245.785231][T1995658]  ffff8881749ab480: 00 00 00 00 00 00 00 00
> > > 00 00 00 00 00 00 fc fc
> > > [177245.816146][T1995658]
> > > ==================================================================
> > >
> > > And after that:
> > >
> > > [177245.816196][T1995658] general protection fault, probably for
> > > non-canonical address 0xdffffc0000000011: 0000 [#1] SMP KASAN PTI
> > > [177245.854054][T1995658] KASAN: null-ptr-deref in range
> > > [0x0000000000000088-0x000000000000008f]
> > > [177245.865836][T1995658] CPU: 7 PID: 1995658 Comm: exe Tainted: G
> > > B   W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> > > [177245.865842][T1995658] Hardware name: Quanta Cloud Technology Inc.
> > > QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> > > [177245.865845][T1995658] RIP: 0010:workingset_activation+0x175/0x2f0
> > > [177245.909096][T1995658] Code: 80 3c 02 00 0f 85 58 01 00 00 4a 8b ac
> > > ed b8 0f 00 00 48 8d bd 88 00 00 00 48 b8 00 00 00 00 00 fc ff df 48
> > > 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 0d 01 00 00 4c 3b a5 88 00 00 00
> > > 0f 85 c3 00 00
> > > [177245.936452][T1995658] RSP: 0018:ffff8881ec51f3c0 EFLAGS: 00010206
> > > [177245.936457][T1995658] RAX: dffffc0000000000 RBX: ffffea003237e480
> > > RCX: ffffffffa25388a6
> > > [177245.936460][T1995658] RDX: 0000000000000011 RSI: 0000000000000246
> > > RDI: 0000000000000088
> > > [177245.970834][T1995658] RBP: 0000000000000000 R08: 0000000000000001
> > > R09: ffffffffa6ced907
> > > [177245.970837][T1995658] R10: fffffbfff4d9db20 R11: 0000000000000010
> > > R12: ffff88983ffde000
> > > [177245.970839][T1995658] R13: 0000000000000000 R14: ffff8897a9bbbde0
> > > R15: 0600000c8df93b77
> > > [177246.007059][T1995658] FS:  0000000000000000(0000)
> > > GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> > > [177246.020326][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [177246.020330][T1995658] CR2: 000000c001503000 CR3: 0000001a1fb38006
> > > CR4: 00000000007726e0
> > > [177246.020332][T1995658] DR0: 0000000000000000 DR1: 0000000000000000
> > > DR2: 0000000000000000
> > > [177246.020334][T1995658] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> > > DR7: 0000000000000400
> > > [177246.020337][T1995658] PKRU: 00000000
> > > [177246.020338][T1995658] Call Trace:
> > > [177246.020341][T1995658]  <TASK>
> > > [177246.090639][T1995658]  mark_page_accessed+0x44a/0x6a0
> > > [177246.099947][T1995658]  __handle_changed_spte+0x64c/0xca0 [kvm]
> > > [177246.110177][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177246.110274][T1995658]  ? kvm_vcpu_release+0x4d/0x70 [kvm]
> > > [177246.129880][T1995658]  ? deref_stack_reg+0xe6/0x160
> > > [177246.138925][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > > [177246.139020][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177246.158926][T1995658]  ? update_curr+0x18d/0x5f0
> > > [177246.167666][T1995658]  __handle_changed_spte+0x63c/0xca0 [kvm]
> > > [177246.167762][T1995658]  ? alloc_tdp_mmu_page+0x370/0x370 [kvm]
> > > [177246.187453][T1995658]  zap_gfn_range+0x549/0x620 [kvm]
> > > [177246.196673][T1995658]  ? zap_collapsible_spte_range+0x520/0x520 [kvm]
> > > [177246.196781][T1995658]  ? smp_call_function_single+0x271/0x370
> > > [177246.196789][T1995658]  ? _raw_spin_lock+0x81/0xd0
> > > [177246.196795][T1995658]  ? _raw_spin_lock_bh+0xe0/0xe0
> > > [177246.234516][T1995658]  ? vmx_vcpu_pi_load+0x14c/0x3b0 [kvm_intel]
> > > [177246.244565][T1995658]  kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
> > > [177246.244680][T1995658]  mmu_free_root_page+0x219/0x2c0 [kvm]
> > > [177246.263832][T1995658]  ? ept_invlpg+0x780/0x780 [kvm]
> > > [177246.263923][T1995658]  ? _raw_spin_lock+0xd0/0xd0
> > > [177246.281421][T1995658]  ? handle_pause+0x250/0x250 [kvm_intel]
> > > [177246.291062][T1995658]  kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
> > > [177246.291156][T1995658]  ? mmu_free_root_page+0x2c0/0x2c0 [kvm]
> > > [177246.310225][T1995658]  ?
> > > kvm_clear_async_pf_completion_queue+0x2f/0x510 [kvm]
> > > [177246.310297][T1995658]  kvm_mmu_unload+0x1c/0xa0 [kvm]
> > > [177246.330051][T1995658]  kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
> > > [177246.339422][T1995658]  ? mmu_notifier_unregister+0x26f/0x330
> > > [177246.339434][T1995658]  kvm_put_kvm+0x3b1/0x8b0 [kvm]
> > > [177246.357081][T1995658]  kvm_vcpu_release+0x4e/0x70 [kvm]
> > > [177246.365745][T1995658]  __fput+0x1f7/0x8c0
> > > [177246.365751][T1995658]  task_work_run+0xf8/0x1a0
> > > [177246.380735][T1995658]  do_exit+0x97b/0x2230
> > > [177246.388056][T1995658]  ? _raw_write_lock_irqsave+0xe0/0xe0
> > > [177246.388062][T1995658]  ? mm_update_next_owner+0x750/0x750
> > > [177246.388067][T1995658]  ? _raw_spin_lock_irq+0x82/0xd0
> > > [177246.413105][T1995658]  do_group_exit+0xda/0x2a0
> > > [177246.420604][T1995658]  get_signal+0x3be/0x1e50
> > > [177246.427936][T1995658]  arch_do_signal_or_restart+0x244/0x17f0
> > > [177246.427943][T1995658]  ? audit_log_exit+0x2690/0x2690
> > > [177246.444593][T1995658]  ? shmem_evict_inode+0xad0/0xad0
> > > [177246.444601][T1995658]  ? get_sigframe_size+0x10/0x10
> > > [177246.460403][T1995658]  ? __seccomp_filter+0x117/0xd60
> > > [177246.468231][T1995658]  ? audit_alloc_name+0x440/0x440
> > > [177246.468236][T1995658]  ? get_nth_filter.part.0+0x220/0x220
> > > [177246.468239][T1995658]  ? __audit_syscall_exit+0x794/0xa80
> > > [177246.468244][T1995658]  exit_to_user_mode_prepare+0xcb/0x120
> > > [177246.486130][T2343516] BUG: Bad page state in process dnsdist  pfn:3ac443
> > > [177246.492473][T1995658]  syscall_exit_to_user_mode+0x1d/0x40
> > > [177246.500835][T2343516] page:000000000df8ed4b refcount:0 mapcount:0
> > > mapping:0000000000000000 index:0x1 pfn:0x3ac443
> > > [177246.510416][T1995658]  do_syscall_64+0x4d/0x90
> > > [177246.518772][T2343516] flags:
> > > 0x2ffff80000000a(referenced|dirty|node=0|zone=2|lastcpupid=0x1ffff)
> > > [177246.532071][T1995658]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [177246.539574][T2343516] raw: 002ffff80000000a dead000000000100
> > > dead000000000122 0000000000000000
> > > [177246.551516][T1995658] RIP: 0033:0x4890ca
> > > [177246.560640][T2343516] raw: 0000000000000001 0000000000000000
> > > 00000000ffffffff 0000000000000000
> > > [177246.572553][T1995658] Code: Unable to access opcode bytes at RIP 0x4890a0.
> > > [177246.572556][T1995658] RSP: 002b:000000c000508608 EFLAGS: 00000216
> > > [177246.579708][T2343516] page dumped because:
> > > PAGE_FLAGS_CHECK_AT_PREP flag(s) set
> > > [177246.591667][T1995658]  ORIG_RAX: 000000000000011d
> > > [177246.601947][T2343516] Modules linked in: xt_hashlimit
> > > [177246.611442][T1995658] RAX: 0000000000000000 RBX: 000000c00003e000
> > > RCX: 00000000004890ca
> > > [177246.622186][T2343516]  xt_connlimit nf_conncount
> > > [177246.630505][T1995658] RDX: 0000000036466000 RSI: 0000000000000003
> > > RDI: 0000000000000010
> > > [177246.639127][T2343516]  ip_set_hash_netport xt_length
> > > [177246.650831][T1995658] RBP: 000000c000508660 R08: 0000000000000000
> > > R09: 0000000000000000
> > > [177246.659146][T2343516]  esp4 sit
> > > [177246.670801][T1995658] R10: 000000000676b000 R11: 0000000000000216
> > > R12: 0000000000000009
> > > [177246.679548][T2343516]  ipip tunnel4
> > > [177246.691327][T1995658] R13: 000000c00129d8b8 R14: 0000000000000001
> > > R15: 0000000000000000
> > > [177246.691332][T1995658]  </TASK>
> > > [177246.698177][T2343516]  nft_numgen
> > > [177246.710054][T1995658] Modules linked in: xt_hashlimit
> > > [177246.717376][T2343516]  nft_ct ip_gre
> > > [177246.729337][T1995658]  xt_connlimit nf_conncount
> > > [177246.736385][T2343516]  gre xfrm_user
> > > [177246.743566][T1995658]  ip_set_hash_netport xt_length
> > > [177246.752511][T2343516]  xfrm_algo tcp_diag
> > > [177246.759923][T1995658]  esp4 sit
> > > [177246.768447][T2343516]  udp_diag inet_diag
> > > [177246.775861][T1995658]  ipip tunnel4
> > > [177246.784702][T2343516]  fou6 fou
> > > [177246.792590][T1995658]  nft_numgen nft_ct
> > > [177246.799633][T2343516]  ip6_tunnel tunnel6
> > > [177246.807521][T1995658]  ip_gre gre
> > > [177246.814891][T2343516]  ip_tunnel
> > > [177246.821906][T1995658]  xfrm_user xfrm_algo
> > > [177246.829831][T2343516]  ip6_udp_tunnel udp_tunnel
> > > [177246.837687][T1995658]  tcp_diag udp_diag
> > > [177246.844899][T2343516]  cls_bpf tls
> > > [177246.851988][T1995658]  inet_diag fou6
> > > [177246.860054][T2343516]  xt_NFLOG
> > > [177246.868476][T1995658]  fou ip6_tunnel
> > > [177246.876205][T2343516]  xt_statistic
> > > [177246.883435][T1995658]  tunnel6 ip_tunnel
> > > [177246.890967][T2343516]  nft_compat
> > > [177246.897928][T1995658]  ip6_udp_tunnel udp_tunnel
> > > [177246.905366][T2343516]  veth
> > > [177246.912707][T1995658]  cls_bpf tls
> > > [177246.920424][T2343516]  tun overlay
> > > [177246.927451][T1995658]  xt_NFLOG xt_statistic
> > > [177246.935692][T2343516]  macvlan
> > > [177246.941990][T1995658]  nft_compat veth
> > > [177246.948848][T2343516]  sch_ingress raid0
> > > [177246.955617][T1995658]  tun overlay
> > > [177246.963226][T2343516]  md_mod essiv
> > > [177246.969540][T1995658]  macvlan sch_ingress
> > > [177246.976443][T2343516]  dm_crypt trusted
> > > [177246.983523][T1995658]  raid0 md_mod
> > > [177246.990026][T2343516]  asn1_encoder tee
> > > [177246.996500][T1995658]  essiv dm_crypt
> > > [177247.003663][T2343516]  dm_mod dax
> > > [177247.010435][T1995658]  trusted asn1_encoder
> > > [177247.016794][T2343516]  nfnetlink_log nft_log
> > > [177247.023445][T1995658]  tee dm_mod
> > > [177247.029892][T2343516]  nft_limit
> > > [177247.035861][T1995658]  dax nfnetlink_log
> > > [177247.042663][T2343516]  nft_counter nf_tables
> > > [177247.049640][T1995658]  nft_log
> > > [177247.055470][T2343516]  ip6table_nat
> > > [177247.061129][T1995658]  nft_limit nft_counter
> > > [177247.067401][T2343516]  ip6table_mangle
> > > [177247.073987][T1995658]  nf_tables ip6table_nat
> > > [177247.079498][T2343516]  ip6table_security ip6table_raw
> > > [177247.085277][T1995658]  ip6table_mangle
> > > [177247.091991][T2343516]  ip6table_filter ip6_tables
> > > [177247.098034][T1995658]  ip6table_security ip6table_raw
> > > [177247.104695][T2343516]  xt_nat iptable_nat
> > > [177247.112034][T1995658]  ip6table_filter ip6_tables
> > > [177247.118039][T2343516]  nf_nat xt_TCPMSS
> > > [177247.125004][T1995658]  xt_nat iptable_nat
> > > [177247.132376][T2343516]  xt_u32
> > > [177247.138705][T1995658]  nf_nat xt_TCPMSS xt_u32
> > > [177247.145715][T2343516]  xt_connmark
> > > [177247.151843][T1995658]  xt_connmark iptable_mangle xt_owner
> > > [177247.158146][T2343516]  iptable_mangle
> > > [177247.163422][T1995658]  xt_CT iptable_raw
> > > [177247.170185][T2343516]  xt_owner xt_CT
> > > [177247.175921][T1995658]  xt_state xt_bpf
> > > [177247.183751][T2343516]  iptable_raw xt_state
> > > [177247.189765][T1995658]  xt_mark xt_conntrack
> > > [177247.196014][T2343516]  xt_bpf xt_mark
> > > [177247.202024][T1995658]  xt_multiport xt_comment xt_tcpudp
> > > [177247.208091][T2343516]  xt_conntrack
> > > [177247.214620][T1995658]  xt_set xt_tcpmss iptable_filter
> > > [177247.221195][T2343516]  xt_multiport
> > > [177247.227181][T1995658]  ip_set_hash_net
> > > [177247.234862][T2343516]  xt_comment xt_tcpudp
> > > [177247.240694][T1995658]  ip_set_hash_ip ip_set
> > > [177247.248244][T2343516]  xt_set xt_tcpmss
> > > [177247.254108][T1995658]  nfnetlink sch_fq
> > > [177247.260257][T2343516]  iptable_filter ip_set_hash_net
> > > [177247.266823][T1995658]  tcp_bbr nf_conntrack
> > > [177247.273564][T2343516]  ip_set_hash_ip ip_set
> > > [177247.279798][T1995658]  nf_defrag_ipv6 nf_defrag_ipv4
> > > [177247.286054][T2343516]  nfnetlink sch_fq
> > > [177247.293537][T1995658]  8021q garp
> > > [177247.300146][T2343516]  tcp_bbr nf_conntrack
> > > [177247.306851][T1995658]  stp mrp
> > > [177247.314280][T2343516]  nf_defrag_ipv6
> > > [177247.320726][T1995658]  llc skx_edac
> > > [177247.326501][T2343516]  nf_defrag_ipv4
> > > [177247.333200][T1995658]  x86_pkg_temp_thermal
> > > [177247.338746][T2343516]  8021q garp
> > > [177247.344900][T1995658]  kvm_intel kvm
> > > [177247.350878][T2343516]  stp mrp
> > > [177247.356981][T1995658]  irqbypass crc32_pclmul
> > > [177247.363642][T2343516]  llc skx_edac
> > > [177247.369407][T1995658]  crc32c_intel aesni_intel
> > > [177247.375433][T2343516]  x86_pkg_temp_thermal kvm_intel
> > > [177247.380949][T1995658]  rapl intel_cstate
> > > [177247.387818][T2343516]  kvm irqbypass
> > > [177247.393792][T1995658]  ipmi_ssif sfc
> > > [177247.400823][T2343516]  crc32_pclmul crc32c_intel
> > > [177247.408402][T1995658]  intel_uncore
> > > [177247.414816][T2343516]  aesni_intel rapl
> > > [177247.420895][T1995658]  i2c_i801 xhci_pci
> > > [177247.426998][T2343516]  intel_cstate
> > > [177247.434174][T1995658]  i2c_smbus acpi_ipmi
> > > [177247.440133][T2343516]  ipmi_ssif
> > > [177247.446447][T1995658]  i40e mdio
> > > [177247.452828][T2343516]  sfc intel_uncore
> > > [177247.458764][T1995658]  ioatdma i2c_core xhci_hcd tpm_crb
> > > [177247.465315][T2343516]  i2c_i801
> > > [177247.470986][T1995658]  dca ipmi_si
> > > [177247.476630][T2343516]  xhci_pci i2c_smbus
> > > [177247.482924][T1995658]  ipmi_devintf ipmi_msghandler
> > > [177247.490884][T2343516]  acpi_ipmi
> > > [177247.496448][T1995658]  tpm_tis tpm_tis_core
> > > [177247.502266][T2343516]  i40e
> > > [177247.508870][T1995658]  tpm tiny_power_button
> > > [177247.516227][T2343516]  mdio
> > > [177247.521960][T1995658]  button fuse
> > > [177247.528752][T2343516]  ioatdma i2c_core
> > > [177247.534017][T1995658]  efivarfs
> > > [177247.540736][T2343516]  xhci_hcd tpm_crb
> > > [177247.545945][T1995658]  ip_tables x_tables
> > > [177247.551808][T2343516]  dca ipmi_si
> > > [177247.558097][T1995658]  bcmcrypt(O)
> > > [177247.563798][T2343516]  ipmi_devintf ipmi_msghandler
> > > [177247.570085][T1995658]  crypto_simd
> > > [177247.576506][T2343516]  tpm_tis tpm_tis_core
> > > [177247.582311][T1995658]  cryptd [last unloaded: kheaders]
> > > [177247.582349][T1995658] ---[ end trace bf693a2532f213e5 ]---
> > > [177247.588090][T2343516]  tpm tiny_power_button button fuse efivarfs ip_tables
> > > [177247.671277][T1995658] RIP: 0010:workingset_activation+0x175/0x2f0
> > > [177247.673366][T2343516]  x_tables bcmcrypt(O) crypto_simd
> > > [177247.682978][T1995658] Code: 80 3c 02 00 0f 85 58 01 00 00 4a 8b ac
> > > ed b8 0f 00 00 48 8d bd 88 00 00 00 48 b8 00 00 00 00 00 fc ff df 48
> > > 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 0d 01 00 00 4c 3b a5 88 00 00 00
> > > 0f 85 c3 00 00
> > > [177247.691625][T2343516]  cryptd [last unloaded: kheaders]
> > > [177247.691633][T2343516] CPU: 30 PID: 2343516 Comm: dnsdist Tainted:
> > > G    B D W  O      5.15.5-cloudflare-kasan-2021.11.11 #1
> > > [177247.691639][T2343516] Hardware name: Quanta Cloud Technology Inc.
> > > QuantaPlex T42S-2U/T42S-2U MB (Lewisburg-4), BIOS 3B16.Q102 02/19/2020
> > > [177247.691642][T2343516] Call Trace:
> > > [177247.691645][T2343516]  <TASK>
> > > [177247.699575][T1995658] RSP: 0018:ffff8881ec51f3c0 EFLAGS: 00010206
> > > [177247.724584][T2343516]  dump_stack_lvl+0x34/0x44
> > > [177247.724595][T2343516]  bad_page.cold+0xc0/0xe1
> > > [177247.732606][T1995658]
> > > [177247.746616][T2343516]  rmqueue_bulk+0x8e5/0xe00
> > > [177247.746627][T2343516]  ? find_suitable_fallback+0x470/0x470
> > > [177247.764885][T1995658] RAX: dffffc0000000000 RBX: ffffea003237e480
> > > RCX: ffffffffa25388a6
> > > [177247.771268][T2343516]  get_page_from_freelist+0x18ff/0x2920
> > > [177247.771279][T2343516]  ? __zone_watermark_ok+0x340/0x340
> > > [177247.777285][T1995658] RDX: 0000000000000011 RSI: 0000000000000246
> > > RDI: 0000000000000088
> > > [177247.786537][T2343516]  __alloc_pages+0x2ac/0x5b0
> > > [177247.786543][T2343516]  ? __alloc_pages_slowpath.constprop.0+0x1e20/0x1e20
> > > [177247.786548][T2343516]  alloc_pages_vma+0xbc/0x570
> > > [177247.794244][T1995658] RBP: 0000000000000000 R08: 0000000000000001
> > > R09: ffffffffa6ced907
> > > [177247.801816][T2343516]  __handle_mm_fault+0x14d6/0x3a00
> > > [177247.801822][T2343516]  ? vm_iomap_memory+0x1d0/0x1d0
> > > [177247.801827][T2343516]  ? down_read_trylock+0xeb/0x180
> > > [177247.807247][T1995658] R10: fffffbfff4d9db20 R11: 0000000000000010
> > > R12: ffff88983ffde000
> > > [177247.814873][T2343516]  handle_mm_fault+0x1cc/0x650
> > > [177247.814878][T2343516]  do_user_addr_fault+0x303/0xd40
> > > [177247.814885][T2343516]  exc_page_fault+0x52/0xb0
> > > [177247.823577][T1995658] R13: 0000000000000000 R14: ffff8897a9bbbde0
> > > R15: 0600000c8df93b77
> > > [177247.834904][T2343516]  ? asm_exc_page_fault+0x8/0x30
> > > [177247.834911][T2343516]  asm_exc_page_fault+0x1e/0x30
> > > [177247.834915][T2343516] RIP: 0033:0x557a8ea64f19
> > > [177247.843738][T1995658] FS:  0000000000000000(0000)
> > > GS:ffff8897a9b80000(0000) knlGS:0000000000000000
> > > [177247.852290][T2343516] Code: c7 80 04 ff ff ff 00 00 00 00 c7 80 18
> > > ff ff ff 00 00 00 00 48 c7 80 20 ff ff ff 00 00 00 00 48 c7 80 28 ff
> > > ff ff 00 00 00 00 <c6> 80 30 ff ff ff 01 c6 80 38 ff ff ff 01 c6 80 39
> > > ff ff ff 00 48
> > > [177247.852295][T2343516] RSP: 002b:00007fff6f3f11c0 EFLAGS: 00010246
> > > [177247.852299][T2343516] RAX: 0000557a9840d0d0 RBX: 0000557a92334cc0
> > > RCX: 0000000000000000
> > > [177247.852301][T2343516] RDX: ffffffffffffffff RSI: 0000000000000000
> > > RDI: 0000000000000000
> > > [177247.852304][T2343516] RBP: 00000000000099d4 R08: 0000000000000000
> > > R09: 0000000000000000
> > > [177247.863709][T1995658] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [177247.871673][T2343516] R10: 0000000000000000 R11: 0000000000000000
> > > R12: 0000557a991452f0
> > > [177247.871676][T2343516] R13: 00007fff6f3f1420 R14: 00007fff6f3f1440
> > > R15: 0000000000000001
> > > [177247.871680][T2343516]  </TASK>
> > >
> > > After this the machine starts spitting some traces starting with:
> > >
> > > [177247.871683][T2343516] BUG: Bad page state in process <some comm
> > > name>  pfn:fe680a
> > >
> > > And eventually gradually locks up:
> > >
> > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 81
> > >
> > > The comment in kvm_main.c before the code mentioned in the first
> > > warning states that the warning is there to indicate incorrect usage
> > > of the function - and probably it is, given the consequences.
> > >
> > > About our workload: this bug is most likely triggered by gvisor [1]
> > > with the KVM backend as we don't have any other KVM users on these
> > > systems.
> > >
> > > We suspect it was not triggered before as kernels before 5.15 did not
> > > have TDP MMU enabled by default [2].
> > >
> > > It seems we even want to remove this warning as overaggressive [3],
> > > however it is indicative in this case.
> > >
> > > Unfortunately, I couldn't easily reproduce the issue synthetically
> > > (tried both running the KVM selftests as well as gvisor KVM tests).
> > > Any help/pointers would be appreciated.
> > >
> > > [1]: https://github.com/google/gvisor
> > > [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=71ba3f3189c78f756a659568fb473600fd78f207
> > > [3]: https://lore.kernel.org/kvm/20211129034317.2964790-5-stevensd@google.com/
> > >
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30 10:58   ` Ignat Korchagin
  2021-11-30 10:59     ` Ignat Korchagin
@ 2021-11-30 11:11     ` Paolo Bonzini
  2021-11-30 11:19       ` Ignat Korchagin
  2021-11-30 20:23     ` Sean Christopherson
  2 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2021-11-30 11:11 UTC (permalink / raw)
  To: Ignat Korchagin, kvm; +Cc: stevensd, kernel-team

On 11/30/21 11:58, Ignat Korchagin wrote:
> I have managed to reliably reproduce the issue on a QEMU VM (on a host
> with nested virtualisation enabled). Here are the steps:
> 
> 1. Install gvisor as per
> https://gvisor.dev/docs/user_guide/install/#install-latest
> 2. Run
> $ for i in $(seq 1 100); do sudo runsc --platform=kvm --network=none
> do echo ok; done
> 
> I've tried to recompile the kernel with the above patch, but
> unfortunately it does fix the issue. I'm happy to try other
> patches/fixes queued for 5.16-rc4

You can find them already in the "for-linus" tag of kvm.git as well as 
in the master branch, but there isn't much else.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30 11:11     ` Paolo Bonzini
@ 2021-11-30 11:19       ` Ignat Korchagin
  2021-11-30 11:43         ` Ignat Korchagin
  0 siblings, 1 reply; 20+ messages in thread
From: Ignat Korchagin @ 2021-11-30 11:19 UTC (permalink / raw)
  To: Paolo Bonzini, kvm; +Cc: stevensd, kernel-team

On Tue, Nov 30, 2021 at 11:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/30/21 11:58, Ignat Korchagin wrote:
> > I have managed to reliably reproduce the issue on a QEMU VM (on a host
> > with nested virtualisation enabled). Here are the steps:
> >
> > 1. Install gvisor as per
> > https://gvisor.dev/docs/user_guide/install/#install-latest
> > 2. Run
> > $ for i in $(seq 1 100); do sudo runsc --platform=kvm --network=none
> > do echo ok; done
> >
> > I've tried to recompile the kernel with the above patch, but
> > unfortunately it does fix the issue. I'm happy to try other
> > patches/fixes queued for 5.16-rc4
>
> You can find them already in the "for-linus" tag of kvm.git as well as
> in the master branch, but there isn't much else.
>
> Paolo

Thanks. I've tried to compile the kernel from kvm.git "for-linus" tag,
but the issue is still there, so probably no commits address the
problem.
Will keep digging.

Ignat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30 11:19       ` Ignat Korchagin
@ 2021-11-30 11:43         ` Ignat Korchagin
  2021-11-30 11:49           ` Paolo Bonzini
  2021-11-30 12:13           ` Paolo Bonzini
  0 siblings, 2 replies; 20+ messages in thread
From: Ignat Korchagin @ 2021-11-30 11:43 UTC (permalink / raw)
  To: Paolo Bonzini, kvm; +Cc: stevensd, kernel-team

On Tue, Nov 30, 2021 at 11:19 AM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> On Tue, Nov 30, 2021 at 11:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 11/30/21 11:58, Ignat Korchagin wrote:
> > > I have managed to reliably reproduce the issue on a QEMU VM (on a host
> > > with nested virtualisation enabled). Here are the steps:
> > >
> > > 1. Install gvisor as per
> > > https://gvisor.dev/docs/user_guide/install/#install-latest
> > > 2. Run
> > > $ for i in $(seq 1 100); do sudo runsc --platform=kvm --network=none
> > > do echo ok; done
> > >
> > > I've tried to recompile the kernel with the above patch, but
> > > unfortunately it does fix the issue. I'm happy to try other
> > > patches/fixes queued for 5.16-rc4
> >
> > You can find them already in the "for-linus" tag of kvm.git as well as
> > in the master branch, but there isn't much else.
> >
> > Paolo
>
> Thanks. I've tried to compile the kernel from kvm.git "for-linus" tag,
> but the issue is still there, so probably no commits address the
> problem.
> Will keep digging.
>
> Ignat

I have also noticed another new warning, when running this on the
kernel from kvm.git branch:

[   70.284354][ T2928] WARNING: CPU: 4 PID: 2928 at
arch/x86/kvm/x86.c:9886 kvm_arch_vcpu_ioctl_run+0x126c/0x17d0
[   70.284354][ T2928] Modules linked in:
[   70.284354][ T2928] CPU: 4 PID: 2928 Comm: exe Not tainted 5.16.0-rc2 #2
[   70.284354][ T2928] Hardware name: QEMU Standard PC (Q35 + ICH9,
2009), BIOS 0.0.0 02/06/2015
[   70.284354][ T2928] RIP: 0010:kvm_arch_vcpu_ioctl_run+0x126c/0x17d0
[   70.284354][ T2928] Code: 49 89 b7 f8 01 00 00 e9 8e ee ff ff 49 8b
87 80 00 00 00 45 31 e4 c7 40 08 07 00 00 00 49 83 87 b8 20 00 00 01
e9 35 f2 ff ff <0f> 0b 4c 89 ff e8 ea 72 03 00 83 f8 01 41 89 c4 0f 85
47 f9 ff ff
[   70.284354][ T2928] RSP: 0018:ffffb09fc0653d60 EFLAGS: 00010002
[   70.284354][ T2928] RAX: 0000000000000000 RBX: 0000000000000000
RCX: ffff9d9083929cc0
[   70.284354][ T2928] RDX: ffff9d9083929c01 RSI: ffffffff92f2e509
RDI: ffffffff92e8010e
[   70.284354][ T2928] RBP: ffffb09fc0653df0 R08: 0000000000000000
R09: ffffb09fc052c340
[   70.284354][ T2928] R10: ffff9d91fffde000 R11: 0000000000034800
R12: 0000000000000000
[   70.284354][ T2928] R13: ffffb09fc052c440 R14: ffff9d90839fc038
R15: ffff9d90839fc000
[   70.284354][ T2928] FS:  0000000001cc6c30(0000)
GS:ffff9d91f7d00000(0000) knlGS:0000000000000000
[   70.284354][ T2928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.284354][ T2928] CR2: 000000c000316000 CR3: 0000000102b4c006
CR4: 0000000000172ee0
[   70.284354][ T2928] Call Trace:
[   70.284354][ T2928]  <TASK>
[   70.284354][ T2928]  ? memcg_slab_free_hook+0xcc/0x190
[   70.284354][ T2928]  ? kmem_cache_free+0x264/0x2b0
[   70.284354][ T2928]  kvm_vcpu_ioctl+0x274/0x680
[   70.284354][ T2928]  ? _raw_spin_lock_irq+0x14/0x2f
[   70.284354][ T2928]  ? _raw_spin_unlock_irq+0x13/0x30
[   70.284354][ T2928]  ? signal_setup_done+0xe9/0x160
[   70.284354][ T2928]  ? fpregs_mark_activate+0x32/0x90
[   70.284354][ T2928]  ? arch_do_signal_or_restart+0x525/0x6b0
[   70.284354][ T2928]  __x64_sys_ioctl+0x40a/0x950
[   70.284354][ T2928]  do_syscall_64+0x3b/0x90
[   70.284354][ T2928]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   70.284354][ T2928] RIP: 0033:0x489516
[   70.284354][ T2928] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b
44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 1b 48 c7 44 24 28 ff ff ff ff 48
c7 44 24 30
[   70.284354][ T2928] RSP: 002b:000000c000009a10 EFLAGS: 00000246
ORIG_RAX: 0000000000000010
[   70.284354][ T2928] RAX: ffffffffffffffda RBX: 000000c0002fa480
RCX: 0000000000489516
[   70.284354][ T2928] RDX: 0000000000000000 RSI: 000000000000ae80
RDI: 0000000000000008
[   70.284354][ T2928] RBP: 000000c000009aa0 R08: 0000000000000001
R09: 0000000000000000
[   70.284354][ T2928] R10: 0000000000000000 R11: 0000000000000246
R12: 0000000000000000
[   70.639977][ T2928] R13: 0000000000000000 R14: 000000000142fb48
R15: 0000000000000000
[   70.639977][ T2928]  </TASK>
[   70.639977][ T2928] ---[ end trace a3a88c91ba4a4df8 ]---

Ignat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30 11:43         ` Ignat Korchagin
@ 2021-11-30 11:49           ` Paolo Bonzini
  2021-11-30 12:13           ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2021-11-30 11:49 UTC (permalink / raw)
  To: Ignat Korchagin, kvm; +Cc: stevensd, kernel-team

On 11/30/21 12:43, Ignat Korchagin wrote:
> I have also noticed another new warning, when running this on the
> kernel from kvm.git branch:
> 
> [   70.284354][ T2928] WARNING: CPU: 4 PID: 2928 at
> arch/x86/kvm/x86.c:9886 kvm_arch_vcpu_ioctl_run+0x126c/0x17d0
> [   70.284354][ T2928] Modules linked in:
> [   70.284354][ T2928] CPU: 4 PID: 2928 Comm: exe Not tainted 5.16.0-rc2 #2
> [   70.284354][ T2928] Hardware name: QEMU Standard PC (Q35 + ICH9,
> 2009), BIOS 0.0.0 02/06/2015
> [   70.284354][ T2928] RIP: 0010:kvm_arch_vcpu_ioctl_run+0x126c/0x17d0
> [   70.284354][ T2928] Code: 49 89 b7 f8 01 00 00 e9 8e ee ff ff 49 8b
> 87 80 00 00 00 45 31 e4 c7 40 08 07 00 00 00 49 83 87 b8 20 00 00 01
> e9 35 f2 ff ff <0f> 0b 4c 89 ff e8 ea 72 03 00 83 f8 01 41 89 c4 0f 85
> 47 f9 ff ff

Can you check which line of the source this is?

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30 11:43         ` Ignat Korchagin
  2021-11-30 11:49           ` Paolo Bonzini
@ 2021-11-30 12:13           ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2021-11-30 12:13 UTC (permalink / raw)
  To: Ignat Korchagin, kvm; +Cc: stevensd, kernel-team

On 11/30/21 12:43, Ignat Korchagin wrote:
> I have also noticed another new warning, when running this on the
> kernel from kvm.git branch:
> 
> [   70.284354][ T2928] WARNING: CPU: 4 PID: 2928 at
> arch/x86/kvm/x86.c:9886 kvm_arch_vcpu_ioctl_run+0x126c/0x17d0
> [   70.284354][ T2928] Modules linked in:
> [   70.284354][ T2928] CPU: 4 PID: 2928 Comm: exe Not tainted 5.16.0-rc2 #2
> [   70.284354][ T2928] Hardware name: QEMU Standard PC (Q35 + ICH9,

Doh, sorry I was on the wrong branch so I couldn't find a WARN at 9886. :)

I'll Cc you on a patch.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30 10:58   ` Ignat Korchagin
  2021-11-30 10:59     ` Ignat Korchagin
  2021-11-30 11:11     ` Paolo Bonzini
@ 2021-11-30 20:23     ` Sean Christopherson
  2021-12-01 23:44       ` Ignat Korchagin
  2 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2021-11-30 20:23 UTC (permalink / raw)
  To: Ignat Korchagin; +Cc: Paolo Bonzini, kvm, stevensd, kernel-team

On Tue, Nov 30, 2021, Ignat Korchagin wrote:
> I have managed to reliably reproduce the issue on a QEMU VM (on a host
> with nested virtualisation enabled). Here are the steps:
> 
> 1. Install gvisor as per
> https://gvisor.dev/docs/user_guide/install/#install-latest
> 2. Run
> $ for i in $(seq 1 100); do sudo runsc --platform=kvm --network=none
> do echo ok; done
> 
> I've tried to recompile the kernel with the above patch, but
> unfortunately it does fix the issue. I'm happy to try other
> patches/fixes queued for 5.16-rc4

My best guest would be https://lore.kernel.org/all/20211120045046.3940942-5-seanjc@google.com/,
that bug results in KVM installing SPTEs into an invalid root.  I think that could
lead to a use-after-free and/or double-free, which is usually what leads to the
"Bad page state" errors.

In the meantime, I'll try to repro.

> > > arch/x86/kvm/../../../virt/kvm/kvm_main.c:171

...

> > > After this the machine starts spitting some traces starting with:
> > >
> > > [177247.871683][T2343516] BUG: Bad page state in process <comm>  pfn:fe680a

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-11-30 20:23     ` Sean Christopherson
@ 2021-12-01 23:44       ` Ignat Korchagin
  2021-12-10 23:04         ` Ignat Korchagin
  0 siblings, 1 reply; 20+ messages in thread
From: Ignat Korchagin @ 2021-12-01 23:44 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, stevensd, kernel-team

On Tue, Nov 30, 2021 at 8:23 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 30, 2021, Ignat Korchagin wrote:
> > I have managed to reliably reproduce the issue on a QEMU VM (on a host
> > with nested virtualisation enabled). Here are the steps:
> >
> > 1. Install gvisor as per
> > https://gvisor.dev/docs/user_guide/install/#install-latest
> > 2. Run
> > $ for i in $(seq 1 100); do sudo runsc --platform=kvm --network=none
> > do echo ok; done
> >
> > I've tried to recompile the kernel with the above patch, but
> > unfortunately it does fix the issue. I'm happy to try other
> > patches/fixes queued for 5.16-rc4
>
> My best guest would be https://lore.kernel.org/all/20211120045046.3940942-5-seanjc@google.com/,
> that bug results in KVM installing SPTEs into an invalid root.  I think that could
> lead to a use-after-free and/or double-free, which is usually what leads to the
> "Bad page state" errors.

Unfortunately, that patch (alone) does not fix it in my repro environment.

Ignat

>
> In the meantime, I'll try to repro.
>
> > > > arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
>
> ...
>
> > > > After this the machine starts spitting some traces starting with:
> > > >
> > > > [177247.871683][T2343516] BUG: Bad page state in process <comm>  pfn:fe680a

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-01 23:44       ` Ignat Korchagin
@ 2021-12-10 23:04         ` Ignat Korchagin
  2021-12-11  1:34           ` David Matlack
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Ignat Korchagin @ 2021-12-10 23:04 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Sean Christopherson, stevensd, kernel-team

I've been trying to figure out the difference between "good" runs and
"bad" runs of gvisor. So, if I've been running the following bpftrace
onliner:

$ bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }'

while also executing a single:

$ sudo runsc --platform=kvm --network=none do echo ok

So, for "good" runs the stacks are the following:

# bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }'
Attaching 1 probe...
^C

@[
    kvm_set_pfn_dirty+1
    __handle_changed_spte+2535
    __tdp_mmu_set_spte+396
    zap_gfn_range+2229
    kvm_tdp_mmu_unmap_gfn_range+331
    kvm_unmap_gfn_range+774
    kvm_mmu_notifier_invalidate_range_start+743
    __mmu_notifier_invalidate_range_start+508
    unmap_vmas+566
    unmap_region+494
    __do_munmap+1172
    __vm_munmap+226
    __x64_sys_munmap+98
    do_syscall_64+64
    entry_SYSCALL_64_after_hwframe+68
]: 1
@[
    kvm_set_pfn_dirty+1
    __handle_changed_spte+2535
    __tdp_mmu_set_spte+396
    zap_gfn_range+2229
    kvm_tdp_mmu_unmap_gfn_range+331
    kvm_unmap_gfn_range+774
    kvm_mmu_notifier_invalidate_range_start+743
    __mmu_notifier_invalidate_range_start+508
    zap_page_range_single+870
    unmap_mapping_pages+434
    shmem_fallocate+2518
    vfs_fallocate+684
    __x64_sys_fallocate+181
    do_syscall_64+64
    entry_SYSCALL_64_after_hwframe+68
]: 32
@[
    kvm_set_pfn_dirty+1
    __handle_changed_spte+2535
    __handle_changed_spte+1746
    __handle_changed_spte+1746
    __handle_changed_spte+1746
    __tdp_mmu_set_spte+396
    zap_gfn_range+2229
    __kvm_tdp_mmu_zap_gfn_range+162
    kvm_tdp_mmu_zap_all+34
    kvm_mmu_zap_all+518
    kvm_mmu_notifier_release+83
    __mmu_notifier_release+420
    exit_mmap+965
    mmput+167
    do_exit+2482
    do_group_exit+236
    get_signal+1000
    arch_do_signal_or_restart+580
    exit_to_user_mode_prepare+300
    syscall_exit_to_user_mode+25
    do_syscall_64+77
    entry_SYSCALL_64_after_hwframe+68
]: 365

For "bad" runs, when I get the warning - I get this:

# bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }'
Attaching 1 probe...
^C

@[
    kvm_set_pfn_dirty+1
    __handle_changed_spte+2535
    __tdp_mmu_set_spte+396
    zap_gfn_range+2229
    kvm_tdp_mmu_unmap_gfn_range+331
    kvm_unmap_gfn_range+774
    kvm_mmu_notifier_invalidate_range_start+743
    __mmu_notifier_invalidate_range_start+508
    unmap_vmas+566
    unmap_region+494
    __do_munmap+1172
    __vm_munmap+226
    __x64_sys_munmap+98
    do_syscall_64+64
    entry_SYSCALL_64_after_hwframe+68
]: 1
@[
    kvm_set_pfn_dirty+1
    __handle_changed_spte+2535
    __handle_changed_spte+1746
    __handle_changed_spte+1746
    __handle_changed_spte+1746
    __tdp_mmu_set_spte+396
    zap_gfn_range+2229
    kvm_tdp_mmu_put_root+465
    mmu_free_root_page+537
    kvm_mmu_free_roots+629
    kvm_mmu_unload+28
    kvm_arch_destroy_vm+510
    kvm_put_kvm+1017
    kvm_vcpu_release+78
    __fput+516
    task_work_run+206
    do_exit+2615
    do_group_exit+236
    get_signal+1000
    arch_do_signal_or_restart+580
    exit_to_user_mode_prepare+300
    syscall_exit_to_user_mode+25
    do_syscall_64+77
    entry_SYSCALL_64_after_hwframe+68
]: 2
@[
    kvm_set_pfn_dirty+1
    __handle_changed_spte+2535
    __tdp_mmu_set_spte+396
    zap_gfn_range+2229
    kvm_tdp_mmu_unmap_gfn_range+331
    kvm_unmap_gfn_range+774
    kvm_mmu_notifier_invalidate_range_start+743
    __mmu_notifier_invalidate_range_start+508
    zap_page_range_single+870
    unmap_mapping_pages+434
    shmem_fallocate+2518
    vfs_fallocate+684
    __x64_sys_fallocate+181
    do_syscall_64+64
    entry_SYSCALL_64_after_hwframe+68
]: 32
@[
    kvm_set_pfn_dirty+1
    __handle_changed_spte+2535
    __handle_changed_spte+1746
    __handle_changed_spte+1746
    __handle_changed_spte+1746
    __tdp_mmu_set_spte+396
    zap_gfn_range+2229
    __kvm_tdp_mmu_zap_gfn_range+162
    kvm_tdp_mmu_zap_all+34
    kvm_mmu_zap_all+518
    kvm_mmu_notifier_release+83
    __mmu_notifier_release+420
    exit_mmap+965
    mmput+167
    do_exit+2482
    do_group_exit+236
    get_signal+1000
    arch_do_signal_or_restart+580
    exit_to_user_mode_prepare+300
    syscall_exit_to_user_mode+25
    do_syscall_64+77
    entry_SYSCALL_64_after_hwframe+68
]: 344

That is, I never get a stack with
kvm_tdp_mmu_put_root->..->kvm_set_pfn_dirty with a "good" run.
Perhaps, this may shed some light onto what is going on.

Ignat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-10 23:04         ` Ignat Korchagin
@ 2021-12-11  1:34           ` David Matlack
  2021-12-11  1:49             ` Paolo Bonzini
  2021-12-11  1:37           ` Paolo Bonzini
  2021-12-11  2:39           ` Sean Christopherson
  2 siblings, 1 reply; 20+ messages in thread
From: David Matlack @ 2021-12-11  1:34 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: kvm, Paolo Bonzini, Sean Christopherson, stevensd, kernel-team

 caOn Fri, Dec 10, 2021 at 3:05 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> I've been trying to figure out the difference between "good" runs and
> "bad" runs of gvisor. So, if I've been running the following bpftrace
> onliner:
>
> $ bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }'
>
> while also executing a single:
>
> $ sudo runsc --platform=kvm --network=none do echo ok
>
> So, for "good" runs the stacks are the following:

The stacks help, thanks for including them. It seems like a race
during do_exit teardown. One thing I notice is that
do_exit->mmput->kvm_mmu_zap_all can interleave with
kvm_vcpu_release->kvm_tdp_mmu_put_root (full call chains omitted),
since the former path allows yielding. But I don't yet see that could
lead to any issues, let alone cause us to encounter a PFN in the EPT
with a zero refcount.

I'll take a closer look next week.

>
> # bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }'
> Attaching 1 probe...
> ^C
>
> @[
>     kvm_set_pfn_dirty+1
>     __handle_changed_spte+2535
>     __tdp_mmu_set_spte+396
>     zap_gfn_range+2229
>     kvm_tdp_mmu_unmap_gfn_range+331
>     kvm_unmap_gfn_range+774
>     kvm_mmu_notifier_invalidate_range_start+743
>     __mmu_notifier_invalidate_range_start+508
>     unmap_vmas+566
>     unmap_region+494
>     __do_munmap+1172
>     __vm_munmap+226
>     __x64_sys_munmap+98
>     do_syscall_64+64
>     entry_SYSCALL_64_after_hwframe+68
> ]: 1
> @[
>     kvm_set_pfn_dirty+1
>     __handle_changed_spte+2535
>     __tdp_mmu_set_spte+396
>     zap_gfn_range+2229
>     kvm_tdp_mmu_unmap_gfn_range+331
>     kvm_unmap_gfn_range+774
>     kvm_mmu_notifier_invalidate_range_start+743
>     __mmu_notifier_invalidate_range_start+508
>     zap_page_range_single+870
>     unmap_mapping_pages+434
>     shmem_fallocate+2518
>     vfs_fallocate+684
>     __x64_sys_fallocate+181
>     do_syscall_64+64
>     entry_SYSCALL_64_after_hwframe+68
> ]: 32
> @[
>     kvm_set_pfn_dirty+1
>     __handle_changed_spte+2535
>     __handle_changed_spte+1746
>     __handle_changed_spte+1746
>     __handle_changed_spte+1746
>     __tdp_mmu_set_spte+396
>     zap_gfn_range+2229
>     __kvm_tdp_mmu_zap_gfn_range+162
>     kvm_tdp_mmu_zap_all+34
>     kvm_mmu_zap_all+518
>     kvm_mmu_notifier_release+83
>     __mmu_notifier_release+420
>     exit_mmap+965
>     mmput+167
>     do_exit+2482
>     do_group_exit+236
>     get_signal+1000
>     arch_do_signal_or_restart+580
>     exit_to_user_mode_prepare+300
>     syscall_exit_to_user_mode+25
>     do_syscall_64+77
>     entry_SYSCALL_64_after_hwframe+68
> ]: 365
>
> For "bad" runs, when I get the warning - I get this:
>
> # bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }'
> Attaching 1 probe...
> ^C
>
> @[
>     kvm_set_pfn_dirty+1
>     __handle_changed_spte+2535
>     __tdp_mmu_set_spte+396
>     zap_gfn_range+2229
>     kvm_tdp_mmu_unmap_gfn_range+331
>     kvm_unmap_gfn_range+774
>     kvm_mmu_notifier_invalidate_range_start+743
>     __mmu_notifier_invalidate_range_start+508
>     unmap_vmas+566
>     unmap_region+494
>     __do_munmap+1172
>     __vm_munmap+226
>     __x64_sys_munmap+98
>     do_syscall_64+64
>     entry_SYSCALL_64_after_hwframe+68
> ]: 1
> @[
>     kvm_set_pfn_dirty+1
>     __handle_changed_spte+2535
>     __handle_changed_spte+1746
>     __handle_changed_spte+1746
>     __handle_changed_spte+1746
>     __tdp_mmu_set_spte+396
>     zap_gfn_range+2229
>     kvm_tdp_mmu_put_root+465
>     mmu_free_root_page+537
>     kvm_mmu_free_roots+629
>     kvm_mmu_unload+28
>     kvm_arch_destroy_vm+510
>     kvm_put_kvm+1017
>     kvm_vcpu_release+78
>     __fput+516
>     task_work_run+206
>     do_exit+2615
>     do_group_exit+236
>     get_signal+1000
>     arch_do_signal_or_restart+580
>     exit_to_user_mode_prepare+300
>     syscall_exit_to_user_mode+25
>     do_syscall_64+77
>     entry_SYSCALL_64_after_hwframe+68
> ]: 2
> @[
>     kvm_set_pfn_dirty+1
>     __handle_changed_spte+2535
>     __tdp_mmu_set_spte+396
>     zap_gfn_range+2229
>     kvm_tdp_mmu_unmap_gfn_range+331
>     kvm_unmap_gfn_range+774
>     kvm_mmu_notifier_invalidate_range_start+743
>     __mmu_notifier_invalidate_range_start+508
>     zap_page_range_single+870
>     unmap_mapping_pages+434
>     shmem_fallocate+2518
>     vfs_fallocate+684
>     __x64_sys_fallocate+181
>     do_syscall_64+64
>     entry_SYSCALL_64_after_hwframe+68
> ]: 32
> @[
>     kvm_set_pfn_dirty+1
>     __handle_changed_spte+2535
>     __handle_changed_spte+1746
>     __handle_changed_spte+1746
>     __handle_changed_spte+1746
>     __tdp_mmu_set_spte+396
>     zap_gfn_range+2229
>     __kvm_tdp_mmu_zap_gfn_range+162
>     kvm_tdp_mmu_zap_all+34
>     kvm_mmu_zap_all+518
>     kvm_mmu_notifier_release+83
>     __mmu_notifier_release+420
>     exit_mmap+965
>     mmput+167
>     do_exit+2482
>     do_group_exit+236
>     get_signal+1000
>     arch_do_signal_or_restart+580
>     exit_to_user_mode_prepare+300
>     syscall_exit_to_user_mode+25
>     do_syscall_64+77
>     entry_SYSCALL_64_after_hwframe+68
> ]: 344
>
> That is, I never get a stack with
> kvm_tdp_mmu_put_root->..->kvm_set_pfn_dirty with a "good" run.
> Perhaps, this may shed some light onto what is going on.
>
> Ignat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-10 23:04         ` Ignat Korchagin
  2021-12-11  1:34           ` David Matlack
@ 2021-12-11  1:37           ` Paolo Bonzini
  2021-12-11  2:39           ` Sean Christopherson
  2 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2021-12-11  1:37 UTC (permalink / raw)
  To: Ignat Korchagin, kvm; +Cc: Sean Christopherson, stevensd, kernel-team

On 12/11/21 00:04, Ignat Korchagin wrote:
> That is, I never get a stack with
> kvm_tdp_mmu_put_root->..->kvm_set_pfn_dirty with a "good" run.
> Perhaps, this may shed some light onto what is going on.

Maybe not kvm_tdp_mmu_put_root->...->kvm_set_pfn_dirty per se, but
do_exit->kvm_tdp_mmu_put_root->...->kvm_set_pfn_dirty seems to be
part of the problem.

Both kvm_set_pfn_dirty and kvm_set_pfn_accessed, which is where
execution really goes in the weeds, have this conditional:

	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
		...

And indeed kvm_is_zone_device_pfn(pfn) returns false if the WARN_ON_ONCE
fires.  What happens is that the page has already been released by the
process's exiting, so it has no A/D tracking anymore.  But the conditional
is true and bad things happen in workingset_activation: while
!page_count(pfn_to_page(pfn)) is definitely not a ZONE_DEVICE page,
it's _also_ not a page that should be marked dirty or accessed.

Something like the following, while completely wrong or at least nothing
more than a bandaid, should at least avoid the worst consequences of the
bug:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 168d0ab93c88..699455715699 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -176,6 +176,14 @@ bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
  	return is_zone_device_page(pfn_to_page(pfn));
  }
  
+static inline bool kvm_pfn_has_accessed_dirty(kvm_pfn_t pfn)
+{
+	if (!pfn_valid(pfn) || !page_count(pfn_to_page(pfn)))
+		return false;
+
+	return !PageReserved(pfn_to_page(pfn)) || is_zero_pfn(pfn);
+}
+
  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
  {
  	/*
@@ -2812,14 +2820,14 @@ EXPORT_SYMBOL_GPL(kvm_release_pfn_dirty);
  
  void kvm_set_pfn_dirty(kvm_pfn_t pfn)
  {
-	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
+	if (kvm_pfn_has_accessed_dirty(pfn))
  		SetPageDirty(pfn_to_page(pfn));
  }
  EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
  
  void kvm_set_pfn_accessed(kvm_pfn_t pfn)
  {
-	if (!kvm_is_reserved_pfn(pfn) && !kvm_is_zone_device_pfn(pfn))
+	if (kvm_pfn_has_accessed_dirty(pfn))
  		mark_page_accessed(pfn_to_page(pfn));
  }
  EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);

The real question is why kvm_mmu_free_roots is finding some dirty pages
in the do_exit->exit_files->...->close_files path, well after exit_mm()
has finished running.  I'm not sure how kvm_mmu_zap_all could leave
something behind.

Th might be completely off track, but maybe it helps someone fixing
the bug while I get some sleep.

Paolo

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-11  1:34           ` David Matlack
@ 2021-12-11  1:49             ` Paolo Bonzini
  2021-12-11 17:46               ` David Matlack
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2021-12-11  1:49 UTC (permalink / raw)
  To: David Matlack, Ignat Korchagin
  Cc: kvm, Sean Christopherson, stevensd, kernel-team

On 12/11/21 02:34, David Matlack wrote:
> The stacks help, thanks for including them. It seems like a race
> during do_exit teardown. One thing I notice is that
> do_exit->mmput->kvm_mmu_zap_all can interleave with
> kvm_vcpu_release->kvm_tdp_mmu_put_root (full call chains omitted),
> since the former path allows yielding. But I don't yet see that could
> lead to any issues, let alone cause us to encounter a PFN in the EPT
> with a zero refcount.

Can it? The call chains are

     zap_gfn_range+2229
     kvm_tdp_mmu_put_root+465
     kvm_mmu_free_roots+629
     kvm_mmu_unload+28
     kvm_arch_destroy_vm+510
     kvm_put_kvm+1017
     kvm_vcpu_release+78
     __fput+516
     task_work_run+206
     do_exit+2615
     do_group_exit+236

and

     zap_gfn_range+2229
     __kvm_tdp_mmu_zap_gfn_range+162
     kvm_tdp_mmu_zap_all+34
     kvm_mmu_zap_all+518
     kvm_mmu_notifier_release+83
     __mmu_notifier_release+420
     exit_mmap+965
     mmput+167
     do_exit+2482
     do_group_exit+236

but there can be no parallelism or interleaving here, because the call 
to kvm_vcpu_release() is scheduled in exit_files() (and performed in 
exit_task_work()).  That comes after exit_mm(), where mmput() is called.

Even if the two could interleave, they go through the same zap_gfn_range 
path.  That path takes the lock for write and only yields on the 512 
top-level page structures.  Anything below is handled by 
tdp_mmu_set_spte's (with mutual recursion between handle_changed_spte 
and handle_removed_tdp_mmu_page), and there are no yields on that path.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-10 23:04         ` Ignat Korchagin
  2021-12-11  1:34           ` David Matlack
  2021-12-11  1:37           ` Paolo Bonzini
@ 2021-12-11  2:39           ` Sean Christopherson
  2021-12-11  9:38             ` Ignat Korchagin
  2021-12-11 20:49             ` Paolo Bonzini
  2 siblings, 2 replies; 20+ messages in thread
From: Sean Christopherson @ 2021-12-11  2:39 UTC (permalink / raw)
  To: Ignat Korchagin; +Cc: kvm, Paolo Bonzini, stevensd, kernel-team

On Fri, Dec 10, 2021, Ignat Korchagin wrote:
> I've been trying to figure out the difference between "good" runs and
> "bad" runs of gvisor. So, if I've been running the following bpftrace
> onliner:

...

> That is, I never get a stack with
> kvm_tdp_mmu_put_root->..->kvm_set_pfn_dirty with a "good" run.
> Perhaps, this may shed some light onto what is going on.

Hmm, a little?

Based on the WARN backtrace, KVM encounters an entire chain of valid, present TDP
MMU paging structures _after_ exit_mm() in the do_exit() path, as the call to
task_work_run() in do_exit() occurs after exit_mm().

That means that kvm_mmu_zap_all() is guaranteed to have been called before the
fatal kvm_arch_destroy_vm(), as either:

  a) exit_mm() put the last reference to mm_users and thus called __mmput ->
     exit_mmap() -> mmu_notifier_release() -> ... -> kvm_mmu_zap_all().

  b) Something else had a reference to mm_users, and so KVM's ->release hook was
     invoked by kvm_destroy_vm() -> mmu_notifier_unregister().

It's probably fairly safe to assume this is a TDP MMU bug, which rules out races
or bad refcounts in other areas.

That means that KVM (a) is somehow losing track of a root, (b) isn't zapping all
SPTEs in kvm_mmu_zap_all(), or (c) is installing a SPTE after the mm has been released.

(a) is unlikely because kvm_tdp_mmu_get_vcpu_root_hpa() is the only way for a
vCPU to get a reference, and it holds mmu_lock for write, doesn't yield, and
either gets a root from the list or adds a root to the list.

(b) is unlikely because I would expect the fallout to be much larger and not
unique to your setup.

That leaves (c), which isn't all that likely either.  I can think of a variety of
ways KVM might write a defunct SPTE, but I can't concoct a scenario where an
entire tree of a present paging structures is written.

Can you run with the below debug patch and see if you get a hit in the failure
scenario?  Or possibly even a non-failure scenario?  This should either confirm
or rule out (c).


---
 arch/x86/kvm/mmu/mmu.c     | 2 ++
 arch/x86/kvm/mmu/tdp_mmu.c | 5 +++++
 include/linux/kvm_host.h   | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1ccee4d17481..e4e283a38570 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5939,6 +5939,8 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	LIST_HEAD(invalid_list);
 	int ign;

+	atomic_set(&kvm->mm_released, 1);
+
 	write_lock(&kvm->mmu_lock);
 restart:
 	list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b69e47e68307..432ccf05f446 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -504,6 +504,9 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 {
 	lockdep_assert_held_read(&kvm->mmu_lock);

+	WARN_ON(atomic_read(&kvm->mm_released) &&
+		new_spte && !is_removed_spte(new_spte));
+
 	/*
 	 * Do not change removed SPTEs. Only the thread that froze the SPTE
 	 * may modify it.
@@ -577,6 +580,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 {
 	lockdep_assert_held_write(&kvm->mmu_lock);

+	WARN_ON(atomic_read(&kvm->mm_released) && new_spte);
+
 	/*
 	 * No thread should be using this function to set SPTEs to the
 	 * temporary removed SPTE value.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e7bfcc3b6b0b..8e76e2f6c3be 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -569,6 +569,8 @@ struct kvm {

 	struct mutex slots_lock;

+	atomic_t mm_released;
+
 	/*
 	 * Protects the arch-specific fields of struct kvm_memory_slots in
 	 * use by the VM. To be used under the slots_lock (above) or in a

base-commit: 1c10f4b4877ffaed602d12ff8cbbd5009e82c970
--

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-11  2:39           ` Sean Christopherson
@ 2021-12-11  9:38             ` Ignat Korchagin
  2021-12-11 20:49             ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Ignat Korchagin @ 2021-12-11  9:38 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, stevensd, kernel-team

On Sat, Dec 11, 2021 at 2:39 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Dec 10, 2021, Ignat Korchagin wrote:
> > I've been trying to figure out the difference between "good" runs and
> > "bad" runs of gvisor. So, if I've been running the following bpftrace
> > onliner:
>
> ...
>
> > That is, I never get a stack with
> > kvm_tdp_mmu_put_root->..->kvm_set_pfn_dirty with a "good" run.
> > Perhaps, this may shed some light onto what is going on.
>
> Hmm, a little?
>
> Based on the WARN backtrace, KVM encounters an entire chain of valid, present TDP
> MMU paging structures _after_ exit_mm() in the do_exit() path, as the call to
> task_work_run() in do_exit() occurs after exit_mm().
>
> That means that kvm_mmu_zap_all() is guaranteed to have been called before the
> fatal kvm_arch_destroy_vm(), as either:
>
>   a) exit_mm() put the last reference to mm_users and thus called __mmput ->
>      exit_mmap() -> mmu_notifier_release() -> ... -> kvm_mmu_zap_all().
>
>   b) Something else had a reference to mm_users, and so KVM's ->release hook was
>      invoked by kvm_destroy_vm() -> mmu_notifier_unregister().
>
> It's probably fairly safe to assume this is a TDP MMU bug, which rules out races
> or bad refcounts in other areas.

Most likely. Currently we're using kvm.tdp_mmu=0 kernel cmdline as a
workaround and haven't encountered any issues.

> That means that KVM (a) is somehow losing track of a root, (b) isn't zapping all
> SPTEs in kvm_mmu_zap_all(), or (c) is installing a SPTE after the mm has been released.
>
> (a) is unlikely because kvm_tdp_mmu_get_vcpu_root_hpa() is the only way for a
> vCPU to get a reference, and it holds mmu_lock for write, doesn't yield, and
> either gets a root from the list or adds a root to the list.
>
> (b) is unlikely because I would expect the fallout to be much larger and not
> unique to your setup.
>
> That leaves (c), which isn't all that likely either.  I can think of a variety of
> ways KVM might write a defunct SPTE, but I can't concoct a scenario where an
> entire tree of a present paging structures is written.
>
> Can you run with the below debug patch and see if you get a hit in the failure
> scenario?  Or possibly even a non-failure scenario?  This should either confirm
> or rule out (c).
>
>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 2 ++
>  arch/x86/kvm/mmu/tdp_mmu.c | 5 +++++
>  include/linux/kvm_host.h   | 2 ++
>  3 files changed, 9 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1ccee4d17481..e4e283a38570 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5939,6 +5939,8 @@ void kvm_mmu_zap_all(struct kvm *kvm)
>         LIST_HEAD(invalid_list);
>         int ign;
>
> +       atomic_set(&kvm->mm_released, 1);
> +
>         write_lock(&kvm->mmu_lock);
>  restart:
>         list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b69e47e68307..432ccf05f446 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -504,6 +504,9 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  {
>         lockdep_assert_held_read(&kvm->mmu_lock);
>
> +       WARN_ON(atomic_read(&kvm->mm_released) &&
> +               new_spte && !is_removed_spte(new_spte));
> +
>         /*
>          * Do not change removed SPTEs. Only the thread that froze the SPTE
>          * may modify it.
> @@ -577,6 +580,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>  {
>         lockdep_assert_held_write(&kvm->mmu_lock);
>
> +       WARN_ON(atomic_read(&kvm->mm_released) && new_spte);
> +
>         /*
>          * No thread should be using this function to set SPTEs to the
>          * temporary removed SPTE value.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e7bfcc3b6b0b..8e76e2f6c3be 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -569,6 +569,8 @@ struct kvm {
>
>         struct mutex slots_lock;
>
> +       atomic_t mm_released;
> +
>         /*
>          * Protects the arch-specific fields of struct kvm_memory_slots in
>          * use by the VM. To be used under the slots_lock (above) or in a
>
> base-commit: 1c10f4b4877ffaed602d12ff8cbbd5009e82c970
> --

Thanks. Applied the patch, but no warnings are triggered neither in
"good" case nor in "bad" case.

Ignat

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-11  1:49             ` Paolo Bonzini
@ 2021-12-11 17:46               ` David Matlack
  0 siblings, 0 replies; 20+ messages in thread
From: David Matlack @ 2021-12-11 17:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ignat Korchagin, kvm, Sean Christopherson, stevensd, kernel-team

On Fri, Dec 10, 2021 at 5:49 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 12/11/21 02:34, David Matlack wrote:
> > The stacks help, thanks for including them. It seems like a race
> > during do_exit teardown. One thing I notice is that
> > do_exit->mmput->kvm_mmu_zap_all can interleave with
> > kvm_vcpu_release->kvm_tdp_mmu_put_root (full call chains omitted),
> > since the former path allows yielding. But I don't yet see that could
> > lead to any issues, let alone cause us to encounter a PFN in the EPT
> > with a zero refcount.
>
> Can it? The call chains are
>
>      zap_gfn_range+2229
>      kvm_tdp_mmu_put_root+465
>      kvm_mmu_free_roots+629
>      kvm_mmu_unload+28
>      kvm_arch_destroy_vm+510
>      kvm_put_kvm+1017
>      kvm_vcpu_release+78
>      __fput+516
>      task_work_run+206
>      do_exit+2615
>      do_group_exit+236
>
> and
>
>      zap_gfn_range+2229
>      __kvm_tdp_mmu_zap_gfn_range+162
>      kvm_tdp_mmu_zap_all+34
>      kvm_mmu_zap_all+518
>      kvm_mmu_notifier_release+83
>      __mmu_notifier_release+420
>      exit_mmap+965
>      mmput+167
>      do_exit+2482
>      do_group_exit+236
>
> but there can be no parallelism or interleaving here, because the call
> to kvm_vcpu_release() is scheduled in exit_files() (and performed in
> exit_task_work()).  That comes after exit_mm(), where mmput() is called.

Ah I was thinking each thread in the process would be run do_exit()
concurrently (first thread enters mmput() but the refcount is not at
zero and proceeds to task_work_run, second enters mmput() and the
refcount is at zero and invokes notifier->release()).

>
> Even if the two could interleave, they go through the same zap_gfn_range
> path.  That path takes the lock for write and only yields on the 512
> top-level page structures.  Anything below is handled by
> tdp_mmu_set_spte's (with mutual recursion between handle_changed_spte
> and handle_removed_tdp_mmu_page), and there are no yields on that path.
>
> Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-11  2:39           ` Sean Christopherson
  2021-12-11  9:38             ` Ignat Korchagin
@ 2021-12-11 20:49             ` Paolo Bonzini
  2021-12-13 16:14               ` Sean Christopherson
  1 sibling, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2021-12-11 20:49 UTC (permalink / raw)
  To: Sean Christopherson, Ignat Korchagin, David Matlack, Ben Gardon
  Cc: kvm, stevensd, kernel-team

On 12/11/21 03:39, Sean Christopherson wrote:
> That means that KVM (a) is somehow losing track of a root, (b) isn't zapping all
> SPTEs in kvm_mmu_zap_all(), or (c) is installing a SPTE after the mm has been released.
> 
> (a) is unlikely because kvm_tdp_mmu_get_vcpu_root_hpa() is the only way for a
> vCPU to get a reference, and it holds mmu_lock for write, doesn't yield, and
> either gets a root from the list or adds a root to the list.
> 
> (b) is unlikely because I would expect the fallout to be much larger and not
> unique to your setup.

Hmm, I think it's kvm_mmu_zap_all() skipping invalidated roots.  One fix
could be the following - untested and uncompiled, after all it's Saturday.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7c5dd83e52de..2e05b6a815b6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -781,18 +781,6 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
  	return flush;
  }
  
-void kvm_tdp_mmu_zap_all(struct kvm *kvm)
-{
-	bool flush = false;
-	int i;
-
-	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
-
-	if (flush)
-		kvm_flush_remote_tlbs(kvm);
-}
-
  static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
  						  struct kvm_mmu_page *prev_root)
  {
@@ -859,6 +847,33 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
  		kvm_flush_remote_tlbs(kvm);
  }
  
+void kvm_tdp_mmu_zap_all(struct kvm *kvm)
+{
+	struct kvm_mmu_page *root, *next_root;
+	bool flush = false;
+
+	/*
+	 * We need to zap all roots, including already-invalid ones.  The
+	 * easiest way is to ensure there's only invalid roots which then,
+	 * for efficiency, we zap with shared==false unlike
+	 * kvm_tdp_mmu_zap_invalidated_roots.
+	 */
+	kvm_tdp_mmu_invalidate_all_roots(kvm);
+
+	root = next_invalidated_root(kvm, NULL);
+	while (root) {
+		flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, false);
+		next_root = next_invalidated_root(kvm, root);
+
+		/* Put the reference acquired in kvm_tdp_mmu_invalidate_roots.  */
+		kvm_tdp_mmu_put_root(kvm, root, false);
+		root = next_root;
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+}
+
  /*
   * Mark each TDP MMU root as invalid so that other threads
   * will drop their references and allow the root count to


Paolo

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Potential bug in TDP MMU
  2021-12-11 20:49             ` Paolo Bonzini
@ 2021-12-13 16:14               ` Sean Christopherson
  0 siblings, 0 replies; 20+ messages in thread
From: Sean Christopherson @ 2021-12-13 16:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ignat Korchagin, David Matlack, Ben Gardon, kvm, stevensd, kernel-team

On Sat, Dec 11, 2021, Paolo Bonzini wrote:
> On 12/11/21 03:39, Sean Christopherson wrote:
> > That means that KVM (a) is somehow losing track of a root, (b) isn't zapping all
> > SPTEs in kvm_mmu_zap_all(), or (c) is installing a SPTE after the mm has been released.
> > 
> > (a) is unlikely because kvm_tdp_mmu_get_vcpu_root_hpa() is the only way for a
> > vCPU to get a reference, and it holds mmu_lock for write, doesn't yield, and
> > either gets a root from the list or adds a root to the list.
> > 
> > (b) is unlikely because I would expect the fallout to be much larger and not
> > unique to your setup.
> 
> Hmm, I think it's kvm_mmu_zap_all() skipping invalidated roots.

That should be impossible.  kvm_mmu_zap_all_fast() invalidates those roots before
it completes, and all paths that lead to kvm_mmu_zap_all_fast() prevent
kvm_destroy_vm() from getting to mmu_notifier_unregister().

kvm_mmu_invalidate_mmio_sptes() and kvm_mmu_invalidate_zap_pages_in_memslot()
are reachable only via memslot update, which requires a reference to KVM and thus
prevents putting the last reference to to KVM.

set_nx_huge_pages() runs with kvm_lock held, which prevent kvm_destroy_vm() from
proceeding to mmu_notifier_unregister().

If your patch does make the problem go away, we have a bug somewhere else.

One other experiment that's probably worth trying at this point is running with
my zap and flush overhaul[*], which is based on commit 81d7c6659da0 ("KVM: VMX:
Remove vCPU from PI wakeup list before updating PID.NV").  I highly doubt it will
fix the issue, but I'm out of other ideas until one of us can reproduce the bug.

https://lore.kernel.org/all/20211120045046.3940942-1-seanjc@google.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-12-13 16:14 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-29 21:44 Potential bug in TDP MMU Ignat Korchagin
2021-11-30  9:29 ` Paolo Bonzini
2021-11-30 10:58   ` Ignat Korchagin
2021-11-30 10:59     ` Ignat Korchagin
2021-11-30 11:11     ` Paolo Bonzini
2021-11-30 11:19       ` Ignat Korchagin
2021-11-30 11:43         ` Ignat Korchagin
2021-11-30 11:49           ` Paolo Bonzini
2021-11-30 12:13           ` Paolo Bonzini
2021-11-30 20:23     ` Sean Christopherson
2021-12-01 23:44       ` Ignat Korchagin
2021-12-10 23:04         ` Ignat Korchagin
2021-12-11  1:34           ` David Matlack
2021-12-11  1:49             ` Paolo Bonzini
2021-12-11 17:46               ` David Matlack
2021-12-11  1:37           ` Paolo Bonzini
2021-12-11  2:39           ` Sean Christopherson
2021-12-11  9:38             ` Ignat Korchagin
2021-12-11 20:49             ` Paolo Bonzini
2021-12-13 16:14               ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.