On Tue, Jul 20, 2021 at 12:13 AM Peter Xu wrote: > On Mon, Jul 19, 2021 at 12:11:21PM -0700, Hugh Dickins wrote: > > Hi Peter, > > Hi, Hugh, > > > > > I believe you have already fixed this, but the fix needs to go to stable. > > Sorry, the messages below are a muddle of top and middle posting, > > I'll resume at the bottom. > > > > On Fri, 16 Jul 2021, Hugh Dickins wrote: > > > On Thu, 15 Jul 2021, Igor Raits wrote: > > > > > > > Hi everyone again, > > > > > > > > I've been trying to reproduce this issue but still can't find a > consistent > > > > pattern. > > > > > > > > However, it did happen once more and this time on 5.13.1: > > > > > > Thanks for the updates, Igor. > > > > > > I have to admit that what you have reported confirms the suspicion > > > that it's a bug introduced by one of my "stable" patches in 5.12.14 > > > (which are also in 5.13): nothing else between 5.12.12 and 5.12.14 > > > seems likely to be relevant. > > > > > > But I've gone back and forth and not been able to spot the problem. > > > > > > Please would you send (either privately to me, or to the list) your > > > 5.13.1 kernel's .config, and disassembly of pmd_migration_entry_wait() > > > from its vmlinux (with line numbers if available; or just send the > > > whole vmlinux if that's easier, and I'll disassemble). > > > > > > I am hoping that the disassembly, together with the register contents > > > that you've shown, will help guide towards an answer. > > > > > > Thanks, > > > Hugh > > > > > > > > > > > [ 222.068216] ------------[ cut here ]------------ > > > > [ 222.072884] kernel BUG at include/linux/swapops.h:204! > > > > [ 222.078062] invalid opcode: 0000 [#1] SMP NOPTI > > > > [ 222.082618] CPU: 38 PID: 9828 Comm: rpc-worker Tainted: G > E > > > > 5.13.1-1.gdc.el8.x86_64 #1 > > > > [ 222.091894] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 > > > > Gen10, BIOS U30 05/24/2021 > > > > [ 222.100468] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 > > > > [ 222.105994] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 > e2 00 > > > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff > ff > > > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48 > > > > [ 222.124878] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246 > > > > [ 222.130134] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX: > > > > ffffffffffffffff > > > > [ 222.137309] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI: > > > > ffffdf55c52cf368 > > > > [ 222.144485] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09: > > > > 0000000000000000 > > > > [ 222.151661] R10: 0000000000000000 R11: 0000000000000000 R12: > > > > 0000000000000bf8 > > > > [ 222.158837] R13: 0400000000000000 R14: 0400000000000080 R15: > > > > ffff9eec2825b1f8 > > > > [ 222.166015] FS: 00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000) > > > > knlGS:0000000000000000 > > > > [ 222.174153] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [ 222.179932] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4: > > > > 00000000007726e0 > > > > [ 222.187109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > 0000000000000000 > > > > [ 222.194283] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > 0000000000000400 > > > > [ 222.201457] PKRU: 55555554 > > > > [ 222.204178] Call Trace: > > > > [ 222.206638] __handle_mm_fault+0x5ad/0x6e0 > > > > [ 222.210760] ? sysvec_call_function_single+0xb/0x90 > > > > [ 222.215672] handle_mm_fault+0xc5/0x290 > > > > [ 222.219529] do_user_addr_fault+0x1a9/0x660 > > > > [ 222.223740] ? sched_clock_cpu+0xc/0xa0 > > > > [ 222.227602] exc_page_fault+0x68/0x130 > > > > [ 222.231373] ? asm_exc_page_fault+0x8/0x30 > > > > [ 222.235495] asm_exc_page_fault+0x1e/0x30 > > > > [ 222.239526] RIP: 0033:0x7f67baaed734 > > > > [ 222.243120] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00 > 31 c0 > > > > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 > 22 > > > > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7 > > > > [ 222.262002] RSP: 002b:00007f6754aea298 EFLAGS: 00010287 > > > > [ 222.267257] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > > > > 0000000000000000 > > > > [ 222.274432] RDX: 00007f676ffff700 RSI: 00007f676ffff9c0 RDI: > > > > 00007f676f7fec10 > > > > [ 222.281609] RBP: 0000000000000001 R08: 00007f676f7fed10 R09: > > > > 00007f67bad012f0 > > > > [ 222.288785] R10: 00007f6754aeb700 R11: 0000000000000202 R12: > > > > 0000000000000001 > > > > [ 222.295961] R13: 0000000000000006 R14: 0000000000000e28 R15: > > > > 00007f674006e1f0 > > > > [ 222.303137] Modules linked in: vhost_net(E) vhost(E) > vhost_iotlb(E) > > > > tap(E) tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) > > > > nf_tables(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) > > > > binfmt_misc(E) iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) > > > > bonding(E) tls(E) vfat(E) fat(E) dm_service_time(E) dm_multipath(E) > > > > rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E) > iscsi_target_mod(E) > > > > target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) > libiscsi(E) > > > > intel_rapl_msr(E) intel_rapl_common(E) scsi_transport_iscsi(E) > > > > isst_if_common(E) ipmi_ssif(E) nfit(E) libnvdimm(E) > x86_pkg_temp_thermal(E) > > > > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) > > > > crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) > qedr(E) > > > > mei_me(E) acpi_ipmi(E) ib_uverbs(E) intel_cstate(E) ipmi_si(E) > ib_core(E) > > > > ipmi_devintf(E) dm_mod(E) ioatdma(E) ses(E) intel_uncore(E) pcspkr(E) > > > > enclosure(E) mei(E) hpwdt(E) hpilo(E) lpc_ich(E) intel_pch_thermal(E) > > > > dca(E) ipmi_msghandler(E) > > > > [ 222.303181] acpi_power_meter(E) ext4(E) mbcache(E) jbd2(E) > sd_mod(E) > > > > t10_pi(E) sg(E) qedf(E) qede(E) libfcoe(E) qed(E) libfc(E) > smartpqi(E) > > > > scsi_transport_fc(E) tg3(E) scsi_transport_sas(E) crc8(E) wmi(E) > > > > nf_conntrack(E) libcrc32c(E) crc32c_intel(E) nf_defrag_ipv6(E) > > > > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) > > > > [ 222.420050] ---[ end trace bcf7b6d1610cc21f ]--- > > > > [ 222.572925] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 > > > > [ 222.578469] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 > e2 00 > > > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff > ff > > > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48 > > > > [ 222.597359] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246 > > > > [ 222.602620] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX: > > > > ffffffffffffffff > > > > [ 222.609807] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI: > > > > ffffdf55c52cf368 > > > > [ 222.616990] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09: > > > > 0000000000000000 > > > > [ 222.624177] R10: 0000000000000000 R11: 0000000000000000 R12: > > > > 0000000000000bf8 > > > > [ 222.631361] R13: 0400000000000000 R14: 0400000000000080 R15: > > > > ffff9eec2825b1f8 > > > > [ 222.638548] FS: 00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000) > > > > knlGS:0000000000000000 > > > > [ 222.646694] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [ 222.652481] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4: > > > > 00000000007726e0 > > > > [ 222.659665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > 0000000000000000 > > > > [ 222.666850] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > 0000000000000400 > > > > [ 222.674031] PKRU: 55555554 > > > > [ 222.676758] Kernel panic - not syncing: Fatal exception > > > > [ 222.817538] Kernel Offset: 0x16000000 from 0xffffffff81000000 > > > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > > > [ 222.965540] ---[ end Kernel panic - not syncing: Fatal exception > ]--- > > > > > > > > On Sun, Jul 11, 2021 at 8:06 AM Igor Raits > wrote: > > > > > > > > > Hi Hugh, > > > > > > > > > > On Sun, Jul 11, 2021 at 6:17 AM Hugh Dickins > wrote: > > > > > > > > > >> On Sat, 10 Jul 2021, Igor Raits wrote: > > > > >> > > > > >> > Hello, > > > > >> > > > > > >> > I've seen one weird bug on 5.12.14 that happened a couple of > times when > > > > >> I > > > > >> > started a bunch of VMs on a server. > > > > >> > > > > >> Would it be possible for you to try the same on a 5.12.13 kernel? > > > > >> Perhaps by reverting the diff between 5.12.13 and 5.12.14 > temporarily. > > > > >> Enough to form an impression of whether the issue is new in > 5.12.14. > > > > >> > > > > > > > > > > We've been using 5.12.12 for quite some time (~ a month) and I > never saw > > > > > it there. > > > > > > > > > > But I have to admit that I don't really have a reproducer. For > example, on > > > > > servers where it happened, > > > > > I just rebooted them and panic did not happen anymore (so I saw it > only > > > > > only once, > > > > > only on 2 servers out of 32 that we have on 5.12.14). > > > > > > > > > > > > > > >> I ask because 5.12.14 did include several fixes and cleanups from > me > > > > >> to page_vma_mapped_walk(), and that is involved in inserting and > > > > >> removing pmd migration entries. I am not aware of introducing any > > > > >> bug there, but your report has got me worried. If it's happening > in > > > > >> 5.12.14 but not in 5.12.13, then I must look again at my changes. > > > > >> > > > > >> I don't expect Hillf's patch to help at at all: the pmd_lock() > > > > >> is supposed to be taken by page_vma_mapped_walk(), before > > > > >> set_pmd_migration_entry() and remove_migration_pmd() are called. > > > > >> > > > > >> Thanks, > > > > >> Hugh > > > > >> > > > > >> > > > > > >> > I've briefly googled this problem but could not find any > relevant commit > > > > >> > that would fix this issue. > > > > >> > > > > > >> > Do you have any hint how to debug this further or know the fix > by any > > > > >> > chance? > > > > >> > > > > > >> > Thanks in advance. Stack trace following: > > > > >> > > > > > >> > [ 376.876610] ------------[ cut here ]------------ > > > > >> > [ 376.881274] kernel BUG at include/linux/swapops.h:204! > > > > >> > [ 376.886455] invalid opcode: 0000 [#1] SMP NOPTI > > > > >> > [ 376.891014] CPU: 40 PID: 11775 Comm: rpc-worker Tainted: G > > > > >> E > > > > >> > 5.12.14-1.gdc.el8.x86_64 #1 > > > > >> > [ 376.900464] Hardware name: HPE ProLiant DL380 Gen10/ProLiant > DL380 > > > > >> > Gen10, BIOS U30 05/24/2021 > > > > >> > [ 376.909038] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 > > > > >> > [ 376.914562] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff > 48 81 e2 > > > > >> 00 > > > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 > ff ff ff > > > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 > 55 48 > > > > >> > [ 376.933443] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246 > > > > >> > [ 376.938701] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX: > > > > >> > ffffffffffffffff > > > > >> > [ 376.945878] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI: > > > > >> > fffff497473b2ae8 > > > > >> > [ 376.953055] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09: > > > > >> > 0000000000000000 > > > > >> > [ 376.960230] R10: 0000000000000000 R11: 0000000000000000 R12: > > > > >> > 0000000000000af8 > > > > >> > [ 376.967407] R13: 0400000000000000 R14: 0400000000000080 R15: > > > > >> > ffff908bbef7b6a8 > > > > >> > [ 376.974582] FS: 00007f5bb1f81700(0000) > GS:ffff90e87fd80000(0000) > > > > >> > knlGS:0000000000000000 > > > > >> > [ 376.982718] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > >> > [ 376.988497] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4: > > > > >> > 00000000007726e0 > > > > >> > [ 376.995673] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > >> > 0000000000000000 > > > > >> > [ 377.002849] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > >> > 0000000000000400 > > > > >> > [ 377.010026] PKRU: 55555554 > > > > >> > [ 377.012745] Call Trace: > > > > >> > [ 377.015207] __handle_mm_fault+0x5ad/0x6e0 > > > > >> > [ 377.019335] handle_mm_fault+0xc5/0x290 > > > > >> > [ 377.023194] do_user_addr_fault+0x1cd/0x740 > > > > >> > [ 377.027406] exc_page_fault+0x54/0x110 > > > > >> > [ 377.031182] ? asm_exc_page_fault+0x8/0x30 > > > > >> > [ 377.035307] asm_exc_page_fault+0x1e/0x30 > > > > >> > [ 377.039340] RIP: 0033:0x7f5bb91d6734 > > > > >> > [ 377.042937] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b > 21 00 31 > > > > >> c0 > > > > >> > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 > d2 74 22 > > > > >> > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 > 00 c7 > > > > >> > [ 377.061820] RSP: 002b:00007f5bb1f7ff58 EFLAGS: 00010206 > > > > >> > [ 377.067076] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > > > > >> > 00007f5ba0000020 > > > > >> > [ 377.074255] RDX: 00007f5b2bfff700 RSI: 00007f5b2bfff9c0 RDI: > > > > >> > 0000000000000001 > > > > >> > [ 377.081429] RBP: 0000000000000001 R08: 0000000000000000 R09: > > > > >> > 00007f5bb93ea2f0 > > > > >> > [ 377.088606] R10: 00007f5bb1f81700 R11: 0000000000000202 R12: > > > > >> > 0000000000000001 > > > > >> > [ 377.095782] R13: 0000000000000006 R14: 0000000000000cb4 R15: > > > > >> > 00007f5bb1f801f0 > > > > >> > [ 377.102958] Modules linked in: ebt_arp(E) nft_meta_bridge(E) > > > > >> > ip6_tables(E) xt_CT(E) nf_log_ipv4(E) nf_log_common(E) > nft_limit(E) > > > > >> > nft_counter(E) xt_LOG(E) xt_limit(E) xt_mac(E) xt_set(E) > xt_multiport(E) > > > > >> > xt_state(E) xt_conntrack(E) xt_comment(E) xt_physdev(E) > nft_compat(E) > > > > >> > ip_set_hash_net(E) ip_set(E) vhost_net(E) vhost(E) > vhost_iotlb(E) tap(E) > > > > >> > tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) > nf_tables(E) > > > > >> > vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) > binfmt_misc(E) > > > > >> > iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E) > tls(E) > > > > >> > vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E) > sunrpc(E) > > > > >> > rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) > > > > >> target_core_mod(E) > > > > >> > ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E) > > > > >> scsi_transport_iscsi(E) > > > > >> > intel_rapl_msr(E) qedr(E) intel_rapl_common(E) ib_uverbs(E) > > > > >> > isst_if_common(E) ib_core(E) nfit(E) libnvdimm(E) > > > > >> x86_pkg_temp_thermal(E) > > > > >> > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) > > > > >> > crct10dif_pclmul(E) > > > > >> > [ 377.102999] crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) > > > > >> > intel_cstate(E) ipmi_ssif(E) acpi_ipmi(E) ipmi_si(E) mei_me(E) > > > > >> ioatdma(E) > > > > >> > ipmi_devintf(E) dm_mod(E) ses(E) intel_uncore(E) pcspkr(E) > qede(E) > > > > >> > enclosure(E) tg3(E) mei(E) lpc_ich(E) hpilo(E) hpwdt(E) > > > > >> > intel_pch_thermal(E) dca(E) ipmi_msghandler(E) > acpi_power_meter(E) > > > > >> ext4(E) > > > > >> > mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf(E) qed(E) > crc8(E) > > > > >> > libfcoe(E) libfc(E) smartpqi(E) scsi_transport_fc(E) > > > > >> scsi_transport_sas(E) > > > > >> > wmi(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) > crc32c_intel(E) > > > > >> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) > > > > >> > [ 377.243468] ---[ end trace 04bce3bb051f7620 ]--- > > > > >> > [ 377.385645] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 > > > > >> > [ 377.391194] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff > 48 81 e2 > > > > >> 00 > > > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 > ff ff ff > > > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 > 55 48 > > > > >> > [ 377.410091] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246 > > > > >> > [ 377.415355] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX: > > > > >> > ffffffffffffffff > > > > >> > [ 377.422540] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI: > > > > >> > fffff497473b2ae8 > > > > >> > [ 377.429721] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09: > > > > >> > 0000000000000000 > > > > >> > [ 377.436902] R10: 0000000000000000 R11: 0000000000000000 R12: > > > > >> > 0000000000000af8 > > > > >> > [ 377.444086] R13: 0400000000000000 R14: 0400000000000080 R15: > > > > >> > ffff908bbef7b6a8 > > > > >> > [ 377.451272] FS: 00007f5bb1f81700(0000) > GS:ffff90e87fd80000(0000) > > > > >> > knlGS:0000000000000000 > > > > >> > [ 377.459415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > >> > [ 377.465196] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4: > > > > >> > 00000000007726e0 > > > > >> > [ 377.472377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > >> > 0000000000000000 > > > > >> > [ 377.479556] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > >> > 0000000000000400 > > > > >> > [ 377.486738] PKRU: 55555554 > > > > >> > [ 377.489465] Kernel panic - not syncing: Fatal exception > > > > >> > [ 377.573911] Kernel Offset: 0xa000000 from 0xffffffff81000000 > > > > >> (relocation > > > > >> > range: 0xffffffff80000000-0xffffffffbfffffff) > > > > >> > [ 377.716482] ---[ end Kernel panic - not syncing: Fatal > exception ]--- > > > > Disassembly of the vmlinux Igor sent (along with other info) confirmed > > something I suspected, that R08: fffff49747fa8080 in one of the dumps, > > R08: ffffdf57428d8080 in the other, is the relevant struct page pointer > > (and RAX the page->flags, which look like it was pointing at a good > page). > > > > A page pointer ....8080 in pmd_migration_entry_wait() is interesting: > > normally I'd expect that to be ....0000 or ....8000, pointing to the > > head of a huge page. But instead it's pointing to the second tail > > (though by now that compound page has been freed, and head pointers in > > the tails reset to 0): as if the pfn has been incremented by 2 somehow. > > > > And if the pfn (swp_offset) in the migration entry has got corrupted, > > then it's no surprise that when removing migration entries, > > page_vma_mapped_walk() would see migration_entry_to_page(entry) != page, > > so be unable to replace that migration entry, leaving it behind for the > > user to hit BUG_ON(!PageLocked) in pmd_migration_entry_wait() when > > faulting on it later. > > > > So, what might increment the swp_offset by 2? Hunt around the encodings. > > Hmm, _PAGE_BIT_UFFD_WP is _PAGE_BIT_SOFTW2 which is bit 10, whereas > > _PAGE_BIT_PROTNONE (top bit to be avoided in pte encoding of swap) > > is _PAGE_BIT_GLOBAL is bit 8. After overcoming off-by-one confusions, > > it looks like if something somewhere were to set _PAGE_BIT_UFFD_WP > > in a migration pmd (whereas it's only suitable for a present pmd), > > it would indeed increment the swp_offset by 2. > > > > Hunt for uffd_wps, and run across copy_huge_pmd() in mm/huge_memory.c: > > in Igor's 5.13.1 and 5.12.14 and many others, that says > > if (!(vma->vm_flags & VM_UFFD_WP)) > > pmd = pmd_clear_uffd_wp(pmd); > > just *before* checking is_swap_pmd(). Fixed in 5.14-rc1 in commit > > 8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()"). > > > > But clearing the bit would be harmless, wouldn't it? Because it wouldn't > > be set anyway. Waste a day before remembering what I never forgot but > > somehow blanked out: the L1TF "feature" forced us to invert the offset > > bits in the pte encoding of a swap entry, so there really is a bit set > > there in the pmd entry, and clearing it has the effect of setting it in > > the corresponding swap entry, so incrementing the migration pfn by 2. > > > > I cannot explain why Igor never saw this crash on 5.12.12: maybe > > something else in the environment changed around that time. And it > > will take several days for it to be confirmed as the fix in practice. > > > > But I'm confident that 8f34f1eac382 will prove to be the fix, so Peter > > please prepare some backports of that for the various stable/longterm > > kernels that need it - I've not looked into whether it applies cleanly, > > or depends on other commits too. You fixed several related but different > > things in that commit: but this one is the worst, because it can corrupt > > even those who are not using UFFD_WP at all. > > Looks right to me, b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP > properly > when fork", 2020-04-07) seems to be the culprit. I didn't notice the side > effect in the bug or in the fix, or it should have already land stables. I > am > very sorry for such a preliminary bug that caused this fallout - I really > can't > tell why I completely didn't look at is_swap_pte() that's so obvious > indeed. > > I checked it up, 5.6.y doesn't have the issue commit yet as it's not > marked as > "fixes". It started to show up in 5.7.y~5.13.y. 5.14-rc1 has 8f34f1eac382 > which > is the fix. So I think we need the fix or equivalent fix for 5.7.y~5.13.y. > > 5.12.y & 5.13.y can pick up the fix 8f34f1eac382 cleanly. For the olders > (5.7.y~5.11.y) they can't. I plan to revert b569a1760782 instead. > FTR, even though 8f34f1eac382 applies cleanly it does not compile. The 1st patch of that series is also required (5fc7a5f6fd04) - it removes use of *vma, which is later removed by the patch that fixes the actual problem. > > > > > Many thans for reporting and helping, Igor. > > Hugh > > > > p.s. Peter, unrelated to this particular bug, and should not divert from > > fixing it: but looking again at those swap encodings, and particularly > > the soft_dirty manipulations: they look very fragile. I think uffd_wp > > was wrong to follow that bad example, and your upcoming new encoding > > (that I have previously called elegant) takes it a worse step further. > > > > I think we should change to a rule where the architecture-independent > > swp_entry_t contains *all* the info, including bits for soft_dirty and > > uffd_wp, so that swap entry cases can move immediately to decoding from > > arch-dependent pte to arch-independent swp_entry_t, and do all the > > manipulations on that. But I don't have time to make that change, and > > probably neither do you, and making the change is liable to introduce > > errors itself. So, no immediate plans, but please keep in mind. > > Curious: did we encounter similar issue previously where soft dirty bit is > applied wrongly so causing hard-to-debug issues? > > If this is destined to be the best solution, I can work on both of them. > I am > just worried that's too big a change as you said so we don't know what's > the > most efficient considering total time we use to develop, review and debug > them. > > The other alternative is we fix bugs; I know that's so cheap a word when I > said > it, however we still can't deny it as an option yet. > > We can definitely discuss this out of this thread and I'll prepare the > backport > first. For all the cases, this bug definitely brings some alert, and I'll > keep > that in mind. > > Please let me know if there's any comment on the backport plan above, or > I'll > prepare the patches for all the branches before tomorrow. > > Thanks, > > -- > Peter Xu > > -- Igor Raits Sr. SW Engineer igor@gooddata.com +420 775 117 817 Moravske namesti 1007/14 602 00 Brno-Veveri, Czech Republic Twitter | Facebook | LinkedIn | Blog