On Tue, Jul 20, 2021 at 12:13 AM Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jul 19, 2021 at 12:11:21PM -0700, Hugh Dickins wrote:
> > Hi Peter,
>
> Hi, Hugh,
>
> >
> > I believe you have already fixed this, but the fix needs to go to stable.
> > Sorry, the messages below are a muddle of top and middle posting,
> > I'll resume at the bottom.
> >
> > On Fri, 16 Jul 2021, Hugh Dickins wrote:
> > > On Thu, 15 Jul 2021, Igor Raits wrote:
> > >
> > > > Hi everyone again,
> > > >
> > > > I've been trying to reproduce this issue but still can't find a
> consistent
> > > > pattern.
> > > >
> > > > However, it did happen once more and this time on 5.13.1:
> > >
> > > Thanks for the updates, Igor.
> > >
> > > I have to admit that what you have reported confirms the suspicion
> > > that it's a bug introduced by one of my "stable" patches in 5.12.14
> > > (which are also in 5.13): nothing else between 5.12.12 and 5.12.14
> > > seems likely to be relevant.
> > >
> > > But I've gone back and forth and not been able to spot the problem.
> > >
> > > Please would you send (either privately to me, or to the list) your
> > > 5.13.1 kernel's .config, and disassembly of pmd_migration_entry_wait()
> > > from its vmlinux (with line numbers if available; or just send the
> > > whole vmlinux if that's easier, and I'll disassemble).
> > >
> > > I am hoping that the disassembly, together with the register contents
> > > that you've shown, will help guide towards an answer.
> > >
> > > Thanks,
> > > Hugh
> > >
> > > >
> > > > [  222.068216] ------------[ cut here ]------------
> > > > [  222.072884] kernel BUG at include/linux/swapops.h:204!
> > > > [  222.078062] invalid opcode: 0000 [#1] SMP NOPTI
> > > > [  222.082618] CPU: 38 PID: 9828 Comm: rpc-worker Tainted: G
>     E
> > > >   5.13.1-1.gdc.el8.x86_64 #1
> > > > [  222.091894] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
> > > > Gen10, BIOS U30 05/24/2021
> > > > [  222.100468] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > > > [  222.105994] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81
> e2 00
> > > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff
> ff
> > > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > > > [  222.124878] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246
> > > > [  222.130134] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX:
> > > > ffffffffffffffff
> > > > [  222.137309] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI:
> > > > ffffdf55c52cf368
> > > > [  222.144485] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09:
> > > > 0000000000000000
> > > > [  222.151661] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > > 0000000000000bf8
> > > > [  222.158837] R13: 0400000000000000 R14: 0400000000000080 R15:
> > > > ffff9eec2825b1f8
> > > > [  222.166015] FS:  00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000)
> > > > knlGS:0000000000000000
> > > > [  222.174153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  222.179932] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4:
> > > > 00000000007726e0
> > > > [  222.187109] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [  222.194283] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [  222.201457] PKRU: 55555554
> > > > [  222.204178] Call Trace:
> > > > [  222.206638]  __handle_mm_fault+0x5ad/0x6e0
> > > > [  222.210760]  ? sysvec_call_function_single+0xb/0x90
> > > > [  222.215672]  handle_mm_fault+0xc5/0x290
> > > > [  222.219529]  do_user_addr_fault+0x1a9/0x660
> > > > [  222.223740]  ? sched_clock_cpu+0xc/0xa0
> > > > [  222.227602]  exc_page_fault+0x68/0x130
> > > > [  222.231373]  ? asm_exc_page_fault+0x8/0x30
> > > > [  222.235495]  asm_exc_page_fault+0x1e/0x30
> > > > [  222.239526] RIP: 0033:0x7f67baaed734
> > > > [  222.243120] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00
> 31 c0
> > > > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74
> 22
> > > > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7
> > > > [  222.262002] RSP: 002b:00007f6754aea298 EFLAGS: 00010287
> > > > [  222.267257] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > > > 0000000000000000
> > > > [  222.274432] RDX: 00007f676ffff700 RSI: 00007f676ffff9c0 RDI:
> > > > 00007f676f7fec10
> > > > [  222.281609] RBP: 0000000000000001 R08: 00007f676f7fed10 R09:
> > > > 00007f67bad012f0
> > > > [  222.288785] R10: 00007f6754aeb700 R11: 0000000000000202 R12:
> > > > 0000000000000001
> > > > [  222.295961] R13: 0000000000000006 R14: 0000000000000e28 R15:
> > > > 00007f674006e1f0
> > > > [  222.303137] Modules linked in: vhost_net(E) vhost(E)
> vhost_iotlb(E)
> > > > tap(E) tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E)
> > > > nf_tables(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E)
> > > > binfmt_misc(E) iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E)
> > > > bonding(E) tls(E) vfat(E) fat(E) dm_service_time(E) dm_multipath(E)
> > > > rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E)
> iscsi_target_mod(E)
> > > > target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E)
> libiscsi(E)
> > > > intel_rapl_msr(E) intel_rapl_common(E) scsi_transport_iscsi(E)
> > > > isst_if_common(E) ipmi_ssif(E) nfit(E) libnvdimm(E)
> x86_pkg_temp_thermal(E)
> > > > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)
> > > > crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E)
> qedr(E)
> > > > mei_me(E) acpi_ipmi(E) ib_uverbs(E) intel_cstate(E) ipmi_si(E)
> ib_core(E)
> > > > ipmi_devintf(E) dm_mod(E) ioatdma(E) ses(E) intel_uncore(E) pcspkr(E)
> > > > enclosure(E) mei(E) hpwdt(E) hpilo(E) lpc_ich(E) intel_pch_thermal(E)
> > > > dca(E) ipmi_msghandler(E)
> > > > [  222.303181]  acpi_power_meter(E) ext4(E) mbcache(E) jbd2(E)
> sd_mod(E)
> > > > t10_pi(E) sg(E) qedf(E) qede(E) libfcoe(E) qed(E) libfc(E)
> smartpqi(E)
> > > > scsi_transport_fc(E) tg3(E) scsi_transport_sas(E) crc8(E) wmi(E)
> > > > nf_conntrack(E) libcrc32c(E) crc32c_intel(E) nf_defrag_ipv6(E)
> > > > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E)
> > > > [  222.420050] ---[ end trace bcf7b6d1610cc21f ]---
> > > > [  222.572925] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > > > [  222.578469] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81
> e2 00
> > > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff
> ff
> > > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > > > [  222.597359] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246
> > > > [  222.602620] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX:
> > > > ffffffffffffffff
> > > > [  222.609807] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI:
> > > > ffffdf55c52cf368
> > > > [  222.616990] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09:
> > > > 0000000000000000
> > > > [  222.624177] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > > 0000000000000bf8
> > > > [  222.631361] R13: 0400000000000000 R14: 0400000000000080 R15:
> > > > ffff9eec2825b1f8
> > > > [  222.638548] FS:  00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000)
> > > > knlGS:0000000000000000
> > > > [  222.646694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  222.652481] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4:
> > > > 00000000007726e0
> > > > [  222.659665] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [  222.666850] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [  222.674031] PKRU: 55555554
> > > > [  222.676758] Kernel panic - not syncing: Fatal exception
> > > > [  222.817538] Kernel Offset: 0x16000000 from 0xffffffff81000000
> > > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > > [  222.965540] ---[ end Kernel panic - not syncing: Fatal exception
> ]---
> > > >
> > > > On Sun, Jul 11, 2021 at 8:06 AM Igor Raits <igor@gooddata.com>
> wrote:
> > > >
> > > > > Hi Hugh,
> > > > >
> > > > > On Sun, Jul 11, 2021 at 6:17 AM Hugh Dickins <hughd@google.com>
> wrote:
> > > > >
> > > > >> On Sat, 10 Jul 2021, Igor Raits wrote:
> > > > >>
> > > > >> > Hello,
> > > > >> >
> > > > >> > I've seen one weird bug on 5.12.14 that happened a couple of
> times when
> > > > >> I
> > > > >> > started a bunch of VMs on a server.
> > > > >>
> > > > >> Would it be possible for you to try the same on a 5.12.13 kernel?
> > > > >> Perhaps by reverting the diff between 5.12.13 and 5.12.14
> temporarily.
> > > > >> Enough to form an impression of whether the issue is new in
> 5.12.14.
> > > > >>
> > > > >
> > > > > We've been using 5.12.12 for quite some time (~ a month) and I
> never saw
> > > > > it there.
> > > > >
> > > > > But I have to admit that I don't really have a reproducer. For
> example, on
> > > > > servers where it happened,
> > > > > I just rebooted them and panic did not happen anymore (so I saw it
> only
> > > > > only once,
> > > > > only on 2 servers out of 32 that we have on 5.12.14).
> > > > >
> > > > >
> > > > >> I ask because 5.12.14 did include several fixes and cleanups from
> me
> > > > >> to page_vma_mapped_walk(), and that is involved in inserting and
> > > > >> removing pmd migration entries.  I am not aware of introducing any
> > > > >> bug there, but your report has got me worried.  If it's happening
> in
> > > > >> 5.12.14 but not in 5.12.13, then I must look again at my changes.
> > > > >>
> > > > >> I don't expect Hillf's patch to help at at all: the pmd_lock()
> > > > >> is supposed to be taken by page_vma_mapped_walk(), before
> > > > >> set_pmd_migration_entry() and remove_migration_pmd() are called.
> > > > >>
> > > > >> Thanks,
> > > > >> Hugh
> > > > >>
> > > > >> >
> > > > >> > I've briefly googled this problem but could not find any
> relevant commit
> > > > >> > that would fix this issue.
> > > > >> >
> > > > >> > Do you have any hint how to debug this further or know the fix
> by any
> > > > >> > chance?
> > > > >> >
> > > > >> > Thanks in advance. Stack trace following:
> > > > >> >
> > > > >> > [  376.876610] ------------[ cut here ]------------
> > > > >> > [  376.881274] kernel BUG at include/linux/swapops.h:204!
> > > > >> > [  376.886455] invalid opcode: 0000 [#1] SMP NOPTI
> > > > >> > [  376.891014] CPU: 40 PID: 11775 Comm: rpc-worker Tainted: G
> > > > >>   E
> > > > >> >     5.12.14-1.gdc.el8.x86_64 #1
> > > > >> > [  376.900464] Hardware name: HPE ProLiant DL380 Gen10/ProLiant
> DL380
> > > > >> > Gen10, BIOS U30 05/24/2021
> > > > >> > [  376.909038] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > > > >> > [  376.914562] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff
> 48 81 e2
> > > > >> 00
> > > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44
> ff ff ff
> > > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41
> 55 48
> > > > >> > [  376.933443] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246
> > > > >> > [  376.938701] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX:
> > > > >> > ffffffffffffffff
> > > > >> > [  376.945878] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI:
> > > > >> > fffff497473b2ae8
> > > > >> > [  376.953055] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09:
> > > > >> > 0000000000000000
> > > > >> > [  376.960230] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > > >> > 0000000000000af8
> > > > >> > [  376.967407] R13: 0400000000000000 R14: 0400000000000080 R15:
> > > > >> > ffff908bbef7b6a8
> > > > >> > [  376.974582] FS:  00007f5bb1f81700(0000)
> GS:ffff90e87fd80000(0000)
> > > > >> > knlGS:0000000000000000
> > > > >> > [  376.982718] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >> > [  376.988497] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4:
> > > > >> > 00000000007726e0
> > > > >> > [  376.995673] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > >> > 0000000000000000
> > > > >> > [  377.002849] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > >> > 0000000000000400
> > > > >> > [  377.010026] PKRU: 55555554
> > > > >> > [  377.012745] Call Trace:
> > > > >> > [  377.015207]  __handle_mm_fault+0x5ad/0x6e0
> > > > >> > [  377.019335]  handle_mm_fault+0xc5/0x290
> > > > >> > [  377.023194]  do_user_addr_fault+0x1cd/0x740
> > > > >> > [  377.027406]  exc_page_fault+0x54/0x110
> > > > >> > [  377.031182]  ? asm_exc_page_fault+0x8/0x30
> > > > >> > [  377.035307]  asm_exc_page_fault+0x1e/0x30
> > > > >> > [  377.039340] RIP: 0033:0x7f5bb91d6734
> > > > >> > [  377.042937] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b
> 21 00 31
> > > > >> c0
> > > > >> > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39
> d2 74 22
> > > > >> > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00
> 00 c7
> > > > >> > [  377.061820] RSP: 002b:00007f5bb1f7ff58 EFLAGS: 00010206
> > > > >> > [  377.067076] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > > > >> > 00007f5ba0000020
> > > > >> > [  377.074255] RDX: 00007f5b2bfff700 RSI: 00007f5b2bfff9c0 RDI:
> > > > >> > 0000000000000001
> > > > >> > [  377.081429] RBP: 0000000000000001 R08: 0000000000000000 R09:
> > > > >> > 00007f5bb93ea2f0
> > > > >> > [  377.088606] R10: 00007f5bb1f81700 R11: 0000000000000202 R12:
> > > > >> > 0000000000000001
> > > > >> > [  377.095782] R13: 0000000000000006 R14: 0000000000000cb4 R15:
> > > > >> > 00007f5bb1f801f0
> > > > >> > [  377.102958] Modules linked in: ebt_arp(E) nft_meta_bridge(E)
> > > > >> > ip6_tables(E) xt_CT(E) nf_log_ipv4(E) nf_log_common(E)
> nft_limit(E)
> > > > >> > nft_counter(E) xt_LOG(E) xt_limit(E) xt_mac(E) xt_set(E)
> xt_multiport(E)
> > > > >> > xt_state(E) xt_conntrack(E) xt_comment(E) xt_physdev(E)
> nft_compat(E)
> > > > >> > ip_set_hash_net(E) ip_set(E) vhost_net(E) vhost(E)
> vhost_iotlb(E) tap(E)
> > > > >> > tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E)
> nf_tables(E)
> > > > >> > vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E)
> binfmt_misc(E)
> > > > >> > iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E)
> tls(E)
> > > > >> > vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E)
> sunrpc(E)
> > > > >> > rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E)
> > > > >> target_core_mod(E)
> > > > >> > ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E)
> > > > >> scsi_transport_iscsi(E)
> > > > >> > intel_rapl_msr(E) qedr(E) intel_rapl_common(E) ib_uverbs(E)
> > > > >> > isst_if_common(E) ib_core(E) nfit(E) libnvdimm(E)
> > > > >> x86_pkg_temp_thermal(E)
> > > > >> > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)
> > > > >> > crct10dif_pclmul(E)
> > > > >> > [  377.102999]  crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E)
> > > > >> > intel_cstate(E) ipmi_ssif(E) acpi_ipmi(E) ipmi_si(E) mei_me(E)
> > > > >> ioatdma(E)
> > > > >> > ipmi_devintf(E) dm_mod(E) ses(E) intel_uncore(E) pcspkr(E)
> qede(E)
> > > > >> > enclosure(E) tg3(E) mei(E) lpc_ich(E) hpilo(E) hpwdt(E)
> > > > >> > intel_pch_thermal(E) dca(E) ipmi_msghandler(E)
> acpi_power_meter(E)
> > > > >> ext4(E)
> > > > >> > mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf(E) qed(E)
> crc8(E)
> > > > >> > libfcoe(E) libfc(E) smartpqi(E) scsi_transport_fc(E)
> > > > >> scsi_transport_sas(E)
> > > > >> > wmi(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E)
> crc32c_intel(E)
> > > > >> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E)
> > > > >> > [  377.243468] ---[ end trace 04bce3bb051f7620 ]---
> > > > >> > [  377.385645] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > > > >> > [  377.391194] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff
> 48 81 e2
> > > > >> 00
> > > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44
> ff ff ff
> > > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41
> 55 48
> > > > >> > [  377.410091] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246
> > > > >> > [  377.415355] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX:
> > > > >> > ffffffffffffffff
> > > > >> > [  377.422540] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI:
> > > > >> > fffff497473b2ae8
> > > > >> > [  377.429721] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09:
> > > > >> > 0000000000000000
> > > > >> > [  377.436902] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > > >> > 0000000000000af8
> > > > >> > [  377.444086] R13: 0400000000000000 R14: 0400000000000080 R15:
> > > > >> > ffff908bbef7b6a8
> > > > >> > [  377.451272] FS:  00007f5bb1f81700(0000)
> GS:ffff90e87fd80000(0000)
> > > > >> > knlGS:0000000000000000
> > > > >> > [  377.459415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >> > [  377.465196] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4:
> > > > >> > 00000000007726e0
> > > > >> > [  377.472377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > >> > 0000000000000000
> > > > >> > [  377.479556] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > >> > 0000000000000400
> > > > >> > [  377.486738] PKRU: 55555554
> > > > >> > [  377.489465] Kernel panic - not syncing: Fatal exception
> > > > >> > [  377.573911] Kernel Offset: 0xa000000 from 0xffffffff81000000
> > > > >> (relocation
> > > > >> > range: 0xffffffff80000000-0xffffffffbfffffff)
> > > > >> > [  377.716482] ---[ end Kernel panic - not syncing: Fatal
> exception ]---
> >
> > Disassembly of the vmlinux Igor sent (along with other info) confirmed
> > something I suspected, that R08: fffff49747fa8080 in one of the dumps,
> > R08: ffffdf57428d8080 in the other, is the relevant struct page pointer
> > (and RAX the page->flags, which look like it was pointing at a good
> page).
> >
> > A page pointer ....8080 in pmd_migration_entry_wait() is interesting:
> > normally I'd expect that to be ....0000 or ....8000, pointing to the
> > head of a huge page.  But instead it's pointing to the second tail
> > (though by now that compound page has been freed, and head pointers in
> > the tails reset to 0): as if the pfn has been incremented by 2 somehow.
> >
> > And if the pfn (swp_offset) in the migration entry has got corrupted,
> > then it's no surprise that when removing migration entries,
> > page_vma_mapped_walk() would see migration_entry_to_page(entry) != page,
> > so be unable to replace that migration entry, leaving it behind for the
> > user to hit BUG_ON(!PageLocked) in pmd_migration_entry_wait() when
> > faulting on it later.
> >
> > So, what might increment the swp_offset by 2? Hunt around the encodings.
> > Hmm, _PAGE_BIT_UFFD_WP is _PAGE_BIT_SOFTW2 which is bit 10, whereas
> > _PAGE_BIT_PROTNONE (top bit to be avoided in pte encoding of swap)
> > is _PAGE_BIT_GLOBAL is bit 8. After overcoming off-by-one confusions,
> > it looks like if something somewhere were to set _PAGE_BIT_UFFD_WP
> > in a migration pmd (whereas it's only suitable for a present pmd),
> > it would indeed increment the swp_offset by 2.
> >
> > Hunt for uffd_wps, and run across copy_huge_pmd() in mm/huge_memory.c:
> > in Igor's 5.13.1 and 5.12.14 and many others, that says
> >       if (!(vma->vm_flags & VM_UFFD_WP))
> >               pmd = pmd_clear_uffd_wp(pmd);
> > just *before* checking is_swap_pmd(). Fixed in 5.14-rc1 in commit
> > 8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()").
> >
> > But clearing the bit would be harmless, wouldn't it? Because it wouldn't
> > be set anyway. Waste a day before remembering what I never forgot but
> > somehow blanked out: the L1TF "feature" forced us to invert the offset
> > bits in the pte encoding of a swap entry, so there really is a bit set
> > there in the pmd entry, and clearing it has the effect of setting it in
> > the corresponding swap entry, so incrementing the migration pfn by 2.
> >
> > I cannot explain why Igor never saw this crash on 5.12.12: maybe
> > something else in the environment changed around that time.  And it
> > will take several days for it to be confirmed as the fix in practice.
> >
> > But I'm confident that 8f34f1eac382 will prove to be the fix, so Peter
> > please prepare some backports of that for the various stable/longterm
> > kernels that need it - I've not looked into whether it applies cleanly,
> > or depends on other commits too.  You fixed several related but different
> > things in that commit: but this one is the worst, because it can corrupt
> > even those who are not using UFFD_WP at all.
>
> Looks right to me, b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP
> properly
> when fork", 2020-04-07) seems to be the culprit.  I didn't notice the side
> effect in the bug or in the fix, or it should have already land stables. I
> am
> very sorry for such a preliminary bug that caused this fallout - I really
> can't
> tell why I completely didn't look at is_swap_pte() that's so obvious
> indeed.
>
> I checked it up, 5.6.y doesn't have the issue commit yet as it's not
> marked as
> "fixes". It started to show up in 5.7.y~5.13.y. 5.14-rc1 has 8f34f1eac382
> which
> is the fix.  So I think we need the fix or equivalent fix for 5.7.y~5.13.y.
>
> 5.12.y & 5.13.y can pick up the fix 8f34f1eac382 cleanly.  For the olders
> (5.7.y~5.11.y) they can't.  I plan to revert b569a1760782 instead.
>

FTR, even though 8f34f1eac382 applies cleanly it does not compile.
The 1st patch of that series is also required (5fc7a5f6fd04) - it removes
use of
*vma, which is later removed by the patch that fixes the actual problem.


>
> >
> > Many thans for reporting and helping, Igor.
> > Hugh
> >
> > p.s. Peter, unrelated to this particular bug, and should not divert from
> > fixing it: but looking again at those swap encodings, and particularly
> > the soft_dirty manipulations: they look very fragile. I think uffd_wp
> > was wrong to follow that bad example, and your upcoming new encoding
> > (that I have previously called elegant) takes it a worse step further.
> >
> > I think we should change to a rule where the architecture-independent
> > swp_entry_t contains *all* the info, including bits for soft_dirty and
> > uffd_wp, so that swap entry cases can move immediately to decoding from
> > arch-dependent pte to arch-independent swp_entry_t, and do all the
> > manipulations on that. But I don't have time to make that change, and
> > probably neither do you, and making the change is liable to introduce
> > errors itself. So, no immediate plans, but please keep in mind.
>
> Curious: did we encounter similar issue previously where soft dirty bit is
> applied wrongly so causing hard-to-debug issues?
>
> If this is destined to be the best solution, I can work on both of them.
> I am
> just worried that's too big a change as you said so we don't know what's
> the
> most efficient considering total time we use to develop, review and debug
> them.
>
> The other alternative is we fix bugs; I know that's so cheap a word when I
> said
> it, however we still can't deny it as an option yet.
>
> We can definitely discuss this out of this thread and I'll prepare the
> backport
> first.  For all the cases, this bug definitely brings some alert, and I'll
> keep
> that in mind.
>
> Please let me know if there's any comment on the backport plan above, or
> I'll
> prepare the patches for all the branches before tomorrow.
>
> Thanks,
>
> --
> Peter Xu
>
>

-- 

Igor Raits

Sr. SW Engineer

igor@gooddata.com

+420 775 117 817

Moravske namesti 1007/14

602 00 Brno-Veveri, Czech Republic

Twitter <https://twitter.com/gooddata> | Facebook
<https://www.facebook.com/gooddata> | LinkedIn
<http://www.linkedin.com/company/gooddata> | Blog
<http://www.gooddata.com/blog>


<https://www.gooddata.com/>