Re: kernel BUG at include/linux/swapops.h:204!

From: Peter Xu <peterx@redhat.com>
To: Hugh Dickins <hughd@google.com>
Cc: Igor Raits <igor@gooddata.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hillf Danton <hdanton@sina.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	linux-mm@kvack.org
Subject: Re: kernel BUG at include/linux/swapops.h:204!
Date: Mon, 19 Jul 2021 18:12:59 -0400	[thread overview]
Message-ID: <YPX46x/pet5Sn5gC@t490s> (raw)
In-Reply-To: <796cbb7-5a1c-1ba0-dde5-479aba8224f2@google.com>

On Mon, Jul 19, 2021 at 12:11:21PM -0700, Hugh Dickins wrote:
> Hi Peter,

Hi, Hugh,

> 
> I believe you have already fixed this, but the fix needs to go to stable.
> Sorry, the messages below are a muddle of top and middle posting,
> I'll resume at the bottom.
> 
> On Fri, 16 Jul 2021, Hugh Dickins wrote:
> > On Thu, 15 Jul 2021, Igor Raits wrote:
> > 
> > > Hi everyone again,
> > > 
> > > I've been trying to reproduce this issue but still can't find a consistent
> > > pattern.
> > > 
> > > However, it did happen once more and this time on 5.13.1:
> > 
> > Thanks for the updates, Igor.
> > 
> > I have to admit that what you have reported confirms the suspicion
> > that it's a bug introduced by one of my "stable" patches in 5.12.14
> > (which are also in 5.13): nothing else between 5.12.12 and 5.12.14
> > seems likely to be relevant.
> > 
> > But I've gone back and forth and not been able to spot the problem.
> > 
> > Please would you send (either privately to me, or to the list) your
> > 5.13.1 kernel's .config, and disassembly of pmd_migration_entry_wait()
> > from its vmlinux (with line numbers if available; or just send the
> > whole vmlinux if that's easier, and I'll disassemble).
> > 
> > I am hoping that the disassembly, together with the register contents
> > that you've shown, will help guide towards an answer.
> > 
> > Thanks,
> > Hugh
> > 
> > > 
> > > [  222.068216] ------------[ cut here ]------------
> > > [  222.072884] kernel BUG at include/linux/swapops.h:204!
> > > [  222.078062] invalid opcode: 0000 [#1] SMP NOPTI
> > > [  222.082618] CPU: 38 PID: 9828 Comm: rpc-worker Tainted: G            E
> > >   5.13.1-1.gdc.el8.x86_64 #1
> > > [  222.091894] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
> > > Gen10, BIOS U30 05/24/2021
> > > [  222.100468] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > > [  222.105994] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 e2 00
> > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff
> > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > > [  222.124878] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246
> > > [  222.130134] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX:
> > > ffffffffffffffff
> > > [  222.137309] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI:
> > > ffffdf55c52cf368
> > > [  222.144485] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09:
> > > 0000000000000000
> > > [  222.151661] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000bf8
> > > [  222.158837] R13: 0400000000000000 R14: 0400000000000080 R15:
> > > ffff9eec2825b1f8
> > > [  222.166015] FS:  00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000)
> > > knlGS:0000000000000000
> > > [  222.174153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  222.179932] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4:
> > > 00000000007726e0
> > > [  222.187109] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [  222.194283] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > [  222.201457] PKRU: 55555554
> > > [  222.204178] Call Trace:
> > > [  222.206638]  __handle_mm_fault+0x5ad/0x6e0
> > > [  222.210760]  ? sysvec_call_function_single+0xb/0x90
> > > [  222.215672]  handle_mm_fault+0xc5/0x290
> > > [  222.219529]  do_user_addr_fault+0x1a9/0x660
> > > [  222.223740]  ? sched_clock_cpu+0xc/0xa0
> > > [  222.227602]  exc_page_fault+0x68/0x130
> > > [  222.231373]  ? asm_exc_page_fault+0x8/0x30
> > > [  222.235495]  asm_exc_page_fault+0x1e/0x30
> > > [  222.239526] RIP: 0033:0x7f67baaed734
> > > [  222.243120] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00 31 c0
> > > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 22
> > > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7
> > > [  222.262002] RSP: 002b:00007f6754aea298 EFLAGS: 00010287
> > > [  222.267257] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > > 0000000000000000
> > > [  222.274432] RDX: 00007f676ffff700 RSI: 00007f676ffff9c0 RDI:
> > > 00007f676f7fec10
> > > [  222.281609] RBP: 0000000000000001 R08: 00007f676f7fed10 R09:
> > > 00007f67bad012f0
> > > [  222.288785] R10: 00007f6754aeb700 R11: 0000000000000202 R12:
> > > 0000000000000001
> > > [  222.295961] R13: 0000000000000006 R14: 0000000000000e28 R15:
> > > 00007f674006e1f0
> > > [  222.303137] Modules linked in: vhost_net(E) vhost(E) vhost_iotlb(E)
> > > tap(E) tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E)
> > > nf_tables(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E)
> > > binfmt_misc(E) iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E)
> > > bonding(E) tls(E) vfat(E) fat(E) dm_service_time(E) dm_multipath(E)
> > > rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E)
> > > target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E)
> > > intel_rapl_msr(E) intel_rapl_common(E) scsi_transport_iscsi(E)
> > > isst_if_common(E) ipmi_ssif(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E)
> > > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)
> > > crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) qedr(E)
> > > mei_me(E) acpi_ipmi(E) ib_uverbs(E) intel_cstate(E) ipmi_si(E) ib_core(E)
> > > ipmi_devintf(E) dm_mod(E) ioatdma(E) ses(E) intel_uncore(E) pcspkr(E)
> > > enclosure(E) mei(E) hpwdt(E) hpilo(E) lpc_ich(E) intel_pch_thermal(E)
> > > dca(E) ipmi_msghandler(E)
> > > [  222.303181]  acpi_power_meter(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E)
> > > t10_pi(E) sg(E) qedf(E) qede(E) libfcoe(E) qed(E) libfc(E) smartpqi(E)
> > > scsi_transport_fc(E) tg3(E) scsi_transport_sas(E) crc8(E) wmi(E)
> > > nf_conntrack(E) libcrc32c(E) crc32c_intel(E) nf_defrag_ipv6(E)
> > > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E)
> > > [  222.420050] ---[ end trace bcf7b6d1610cc21f ]---
> > > [  222.572925] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > > [  222.578469] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 e2 00
> > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff
> > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > > [  222.597359] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246
> > > [  222.602620] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX:
> > > ffffffffffffffff
> > > [  222.609807] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI:
> > > ffffdf55c52cf368
> > > [  222.616990] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09:
> > > 0000000000000000
> > > [  222.624177] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000bf8
> > > [  222.631361] R13: 0400000000000000 R14: 0400000000000080 R15:
> > > ffff9eec2825b1f8
> > > [  222.638548] FS:  00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000)
> > > knlGS:0000000000000000
> > > [  222.646694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  222.652481] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4:
> > > 00000000007726e0
> > > [  222.659665] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [  222.666850] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > [  222.674031] PKRU: 55555554
> > > [  222.676758] Kernel panic - not syncing: Fatal exception
> > > [  222.817538] Kernel Offset: 0x16000000 from 0xffffffff81000000
> > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > [  222.965540] ---[ end Kernel panic - not syncing: Fatal exception ]---
> > > 
> > > On Sun, Jul 11, 2021 at 8:06 AM Igor Raits <igor@gooddata.com> wrote:
> > > 
> > > > Hi Hugh,
> > > >
> > > > On Sun, Jul 11, 2021 at 6:17 AM Hugh Dickins <hughd@google.com> wrote:
> > > >
> > > >> On Sat, 10 Jul 2021, Igor Raits wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > I've seen one weird bug on 5.12.14 that happened a couple of times when
> > > >> I
> > > >> > started a bunch of VMs on a server.
> > > >>
> > > >> Would it be possible for you to try the same on a 5.12.13 kernel?
> > > >> Perhaps by reverting the diff between 5.12.13 and 5.12.14 temporarily.
> > > >> Enough to form an impression of whether the issue is new in 5.12.14.
> > > >>
> > > >
> > > > We've been using 5.12.12 for quite some time (~ a month) and I never saw
> > > > it there.
> > > >
> > > > But I have to admit that I don't really have a reproducer. For example, on
> > > > servers where it happened,
> > > > I just rebooted them and panic did not happen anymore (so I saw it only
> > > > only once,
> > > > only on 2 servers out of 32 that we have on 5.12.14).
> > > >
> > > >
> > > >> I ask because 5.12.14 did include several fixes and cleanups from me
> > > >> to page_vma_mapped_walk(), and that is involved in inserting and
> > > >> removing pmd migration entries.  I am not aware of introducing any
> > > >> bug there, but your report has got me worried.  If it's happening in
> > > >> 5.12.14 but not in 5.12.13, then I must look again at my changes.
> > > >>
> > > >> I don't expect Hillf's patch to help at at all: the pmd_lock()
> > > >> is supposed to be taken by page_vma_mapped_walk(), before
> > > >> set_pmd_migration_entry() and remove_migration_pmd() are called.
> > > >>
> > > >> Thanks,
> > > >> Hugh
> > > >>
> > > >> >
> > > >> > I've briefly googled this problem but could not find any relevant commit
> > > >> > that would fix this issue.
> > > >> >
> > > >> > Do you have any hint how to debug this further or know the fix by any
> > > >> > chance?
> > > >> >
> > > >> > Thanks in advance. Stack trace following:
> > > >> >
> > > >> > [  376.876610] ------------[ cut here ]------------
> > > >> > [  376.881274] kernel BUG at include/linux/swapops.h:204!
> > > >> > [  376.886455] invalid opcode: 0000 [#1] SMP NOPTI
> > > >> > [  376.891014] CPU: 40 PID: 11775 Comm: rpc-worker Tainted: G
> > > >>   E
> > > >> >     5.12.14-1.gdc.el8.x86_64 #1
> > > >> > [  376.900464] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
> > > >> > Gen10, BIOS U30 05/24/2021
> > > >> > [  376.909038] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > > >> > [  376.914562] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff 48 81 e2
> > > >> 00
> > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff
> > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > > >> > [  376.933443] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246
> > > >> > [  376.938701] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX:
> > > >> > ffffffffffffffff
> > > >> > [  376.945878] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI:
> > > >> > fffff497473b2ae8
> > > >> > [  376.953055] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09:
> > > >> > 0000000000000000
> > > >> > [  376.960230] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > >> > 0000000000000af8
> > > >> > [  376.967407] R13: 0400000000000000 R14: 0400000000000080 R15:
> > > >> > ffff908bbef7b6a8
> > > >> > [  376.974582] FS:  00007f5bb1f81700(0000) GS:ffff90e87fd80000(0000)
> > > >> > knlGS:0000000000000000
> > > >> > [  376.982718] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> > [  376.988497] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4:
> > > >> > 00000000007726e0
> > > >> > [  376.995673] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > >> > 0000000000000000
> > > >> > [  377.002849] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > >> > 0000000000000400
> > > >> > [  377.010026] PKRU: 55555554
> > > >> > [  377.012745] Call Trace:
> > > >> > [  377.015207]  __handle_mm_fault+0x5ad/0x6e0
> > > >> > [  377.019335]  handle_mm_fault+0xc5/0x290
> > > >> > [  377.023194]  do_user_addr_fault+0x1cd/0x740
> > > >> > [  377.027406]  exc_page_fault+0x54/0x110
> > > >> > [  377.031182]  ? asm_exc_page_fault+0x8/0x30
> > > >> > [  377.035307]  asm_exc_page_fault+0x1e/0x30
> > > >> > [  377.039340] RIP: 0033:0x7f5bb91d6734
> > > >> > [  377.042937] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00 31
> > > >> c0
> > > >> > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 22
> > > >> > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7
> > > >> > [  377.061820] RSP: 002b:00007f5bb1f7ff58 EFLAGS: 00010206
> > > >> > [  377.067076] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > > >> > 00007f5ba0000020
> > > >> > [  377.074255] RDX: 00007f5b2bfff700 RSI: 00007f5b2bfff9c0 RDI:
> > > >> > 0000000000000001
> > > >> > [  377.081429] RBP: 0000000000000001 R08: 0000000000000000 R09:
> > > >> > 00007f5bb93ea2f0
> > > >> > [  377.088606] R10: 00007f5bb1f81700 R11: 0000000000000202 R12:
> > > >> > 0000000000000001
> > > >> > [  377.095782] R13: 0000000000000006 R14: 0000000000000cb4 R15:
> > > >> > 00007f5bb1f801f0
> > > >> > [  377.102958] Modules linked in: ebt_arp(E) nft_meta_bridge(E)
> > > >> > ip6_tables(E) xt_CT(E) nf_log_ipv4(E) nf_log_common(E) nft_limit(E)
> > > >> > nft_counter(E) xt_LOG(E) xt_limit(E) xt_mac(E) xt_set(E) xt_multiport(E)
> > > >> > xt_state(E) xt_conntrack(E) xt_comment(E) xt_physdev(E) nft_compat(E)
> > > >> > ip_set_hash_net(E) ip_set(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E)
> > > >> > tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) nf_tables(E)
> > > >> > vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) binfmt_misc(E)
> > > >> > iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E) tls(E)
> > > >> > vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E) sunrpc(E)
> > > >> > rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E)
> > > >> target_core_mod(E)
> > > >> > ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E)
> > > >> scsi_transport_iscsi(E)
> > > >> > intel_rapl_msr(E) qedr(E) intel_rapl_common(E) ib_uverbs(E)
> > > >> > isst_if_common(E) ib_core(E) nfit(E) libnvdimm(E)
> > > >> x86_pkg_temp_thermal(E)
> > > >> > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)
> > > >> > crct10dif_pclmul(E)
> > > >> > [  377.102999]  crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E)
> > > >> > intel_cstate(E) ipmi_ssif(E) acpi_ipmi(E) ipmi_si(E) mei_me(E)
> > > >> ioatdma(E)
> > > >> > ipmi_devintf(E) dm_mod(E) ses(E) intel_uncore(E) pcspkr(E) qede(E)
> > > >> > enclosure(E) tg3(E) mei(E) lpc_ich(E) hpilo(E) hpwdt(E)
> > > >> > intel_pch_thermal(E) dca(E) ipmi_msghandler(E) acpi_power_meter(E)
> > > >> ext4(E)
> > > >> > mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf(E) qed(E) crc8(E)
> > > >> > libfcoe(E) libfc(E) smartpqi(E) scsi_transport_fc(E)
> > > >> scsi_transport_sas(E)
> > > >> > wmi(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) crc32c_intel(E)
> > > >> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E)
> > > >> > [  377.243468] ---[ end trace 04bce3bb051f7620 ]---
> > > >> > [  377.385645] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > > >> > [  377.391194] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff 48 81 e2
> > > >> 00
> > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff
> > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > > >> > [  377.410091] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246
> > > >> > [  377.415355] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX:
> > > >> > ffffffffffffffff
> > > >> > [  377.422540] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI:
> > > >> > fffff497473b2ae8
> > > >> > [  377.429721] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09:
> > > >> > 0000000000000000
> > > >> > [  377.436902] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > >> > 0000000000000af8
> > > >> > [  377.444086] R13: 0400000000000000 R14: 0400000000000080 R15:
> > > >> > ffff908bbef7b6a8
> > > >> > [  377.451272] FS:  00007f5bb1f81700(0000) GS:ffff90e87fd80000(0000)
> > > >> > knlGS:0000000000000000
> > > >> > [  377.459415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> > [  377.465196] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4:
> > > >> > 00000000007726e0
> > > >> > [  377.472377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > >> > 0000000000000000
> > > >> > [  377.479556] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > >> > 0000000000000400
> > > >> > [  377.486738] PKRU: 55555554
> > > >> > [  377.489465] Kernel panic - not syncing: Fatal exception
> > > >> > [  377.573911] Kernel Offset: 0xa000000 from 0xffffffff81000000
> > > >> (relocation
> > > >> > range: 0xffffffff80000000-0xffffffffbfffffff)
> > > >> > [  377.716482] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> Disassembly of the vmlinux Igor sent (along with other info) confirmed
> something I suspected, that R08: fffff49747fa8080 in one of the dumps,
> R08: ffffdf57428d8080 in the other, is the relevant struct page pointer
> (and RAX the page->flags, which look like it was pointing at a good page).
> 
> A page pointer ....8080 in pmd_migration_entry_wait() is interesting:
> normally I'd expect that to be ....0000 or ....8000, pointing to the
> head of a huge page.  But instead it's pointing to the second tail
> (though by now that compound page has been freed, and head pointers in
> the tails reset to 0): as if the pfn has been incremented by 2 somehow.
> 
> And if the pfn (swp_offset) in the migration entry has got corrupted,
> then it's no surprise that when removing migration entries,
> page_vma_mapped_walk() would see migration_entry_to_page(entry) != page,
> so be unable to replace that migration entry, leaving it behind for the
> user to hit BUG_ON(!PageLocked) in pmd_migration_entry_wait() when
> faulting on it later.
> 
> So, what might increment the swp_offset by 2? Hunt around the encodings.
> Hmm, _PAGE_BIT_UFFD_WP is _PAGE_BIT_SOFTW2 which is bit 10, whereas
> _PAGE_BIT_PROTNONE (top bit to be avoided in pte encoding of swap)
> is _PAGE_BIT_GLOBAL is bit 8. After overcoming off-by-one confusions,
> it looks like if something somewhere were to set _PAGE_BIT_UFFD_WP
> in a migration pmd (whereas it's only suitable for a present pmd),
> it would indeed increment the swp_offset by 2.
> 
> Hunt for uffd_wps, and run across copy_huge_pmd() in mm/huge_memory.c:
> in Igor's 5.13.1 and 5.12.14 and many others, that says
> 	if (!(vma->vm_flags & VM_UFFD_WP))
> 		pmd = pmd_clear_uffd_wp(pmd);
> just *before* checking is_swap_pmd(). Fixed in 5.14-rc1 in commit
> 8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()").
> 
> But clearing the bit would be harmless, wouldn't it? Because it wouldn't
> be set anyway. Waste a day before remembering what I never forgot but
> somehow blanked out: the L1TF "feature" forced us to invert the offset
> bits in the pte encoding of a swap entry, so there really is a bit set
> there in the pmd entry, and clearing it has the effect of setting it in
> the corresponding swap entry, so incrementing the migration pfn by 2.
> 
> I cannot explain why Igor never saw this crash on 5.12.12: maybe
> something else in the environment changed around that time.  And it
> will take several days for it to be confirmed as the fix in practice.
> 
> But I'm confident that 8f34f1eac382 will prove to be the fix, so Peter
> please prepare some backports of that for the various stable/longterm
> kernels that need it - I've not looked into whether it applies cleanly,
> or depends on other commits too.  You fixed several related but different
> things in that commit: but this one is the worst, because it can corrupt
> even those who are not using UFFD_WP at all.

Looks right to me, b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly
when fork", 2020-04-07) seems to be the culprit.  I didn't notice the side
effect in the bug or in the fix, or it should have already land stables. I am
very sorry for such a preliminary bug that caused this fallout - I really can't
tell why I completely didn't look at is_swap_pte() that's so obvious indeed.

I checked it up, 5.6.y doesn't have the issue commit yet as it's not marked as
"fixes". It started to show up in 5.7.y~5.13.y. 5.14-rc1 has 8f34f1eac382 which
is the fix.  So I think we need the fix or equivalent fix for 5.7.y~5.13.y.

5.12.y & 5.13.y can pick up the fix 8f34f1eac382 cleanly.  For the olders
(5.7.y~5.11.y) they can't.  I plan to revert b569a1760782 instead.

> 
> Many thans for reporting and helping, Igor.
> Hugh
> 
> p.s. Peter, unrelated to this particular bug, and should not divert from
> fixing it: but looking again at those swap encodings, and particularly
> the soft_dirty manipulations: they look very fragile. I think uffd_wp
> was wrong to follow that bad example, and your upcoming new encoding
> (that I have previously called elegant) takes it a worse step further.
> 
> I think we should change to a rule where the architecture-independent
> swp_entry_t contains *all* the info, including bits for soft_dirty and
> uffd_wp, so that swap entry cases can move immediately to decoding from
> arch-dependent pte to arch-independent swp_entry_t, and do all the
> manipulations on that. But I don't have time to make that change, and
> probably neither do you, and making the change is liable to introduce
> errors itself. So, no immediate plans, but please keep in mind.

Curious: did we encounter similar issue previously where soft dirty bit is
applied wrongly so causing hard-to-debug issues?

If this is destined to be the best solution, I can work on both of them.  I am
just worried that's too big a change as you said so we don't know what's the
most efficient considering total time we use to develop, review and debug them.

The other alternative is we fix bugs; I know that's so cheap a word when I said
it, however we still can't deny it as an option yet.

We can definitely discuss this out of this thread and I'll prepare the backport
first.  For all the cases, this bug definitely brings some alert, and I'll keep
that in mind.

Please let me know if there's any comment on the backport plan above, or I'll
prepare the patches for all the branches before tomorrow.

Thanks,

-- 
Peter Xu