linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Igor Raits <igor@gooddata.com>, Hugh Dickins <hughd@google.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Hillf Danton <hdanton@sina.com>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	linux-mm@kvack.org
Subject: Re: kernel BUG at include/linux/swapops.h:204!
Date: Mon, 19 Jul 2021 12:11:21 -0700 (PDT)	[thread overview]
Message-ID: <796cbb7-5a1c-1ba0-dde5-479aba8224f2@google.com> (raw)
In-Reply-To: <e9baeaa-b25b-4d9b-de5e-bae678e5e089@google.com>

Hi Peter,

I believe you have already fixed this, but the fix needs to go to stable.
Sorry, the messages below are a muddle of top and middle posting,
I'll resume at the bottom.

On Fri, 16 Jul 2021, Hugh Dickins wrote:
> On Thu, 15 Jul 2021, Igor Raits wrote:
> 
> > Hi everyone again,
> > 
> > I've been trying to reproduce this issue but still can't find a consistent
> > pattern.
> > 
> > However, it did happen once more and this time on 5.13.1:
> 
> Thanks for the updates, Igor.
> 
> I have to admit that what you have reported confirms the suspicion
> that it's a bug introduced by one of my "stable" patches in 5.12.14
> (which are also in 5.13): nothing else between 5.12.12 and 5.12.14
> seems likely to be relevant.
> 
> But I've gone back and forth and not been able to spot the problem.
> 
> Please would you send (either privately to me, or to the list) your
> 5.13.1 kernel's .config, and disassembly of pmd_migration_entry_wait()
> from its vmlinux (with line numbers if available; or just send the
> whole vmlinux if that's easier, and I'll disassemble).
> 
> I am hoping that the disassembly, together with the register contents
> that you've shown, will help guide towards an answer.
> 
> Thanks,
> Hugh
> 
> > 
> > [  222.068216] ------------[ cut here ]------------
> > [  222.072884] kernel BUG at include/linux/swapops.h:204!
> > [  222.078062] invalid opcode: 0000 [#1] SMP NOPTI
> > [  222.082618] CPU: 38 PID: 9828 Comm: rpc-worker Tainted: G            E
> >   5.13.1-1.gdc.el8.x86_64 #1
> > [  222.091894] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
> > Gen10, BIOS U30 05/24/2021
> > [  222.100468] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > [  222.105994] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 e2 00
> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff
> > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > [  222.124878] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246
> > [  222.130134] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX:
> > ffffffffffffffff
> > [  222.137309] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI:
> > ffffdf55c52cf368
> > [  222.144485] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09:
> > 0000000000000000
> > [  222.151661] R10: 0000000000000000 R11: 0000000000000000 R12:
> > 0000000000000bf8
> > [  222.158837] R13: 0400000000000000 R14: 0400000000000080 R15:
> > ffff9eec2825b1f8
> > [  222.166015] FS:  00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000)
> > knlGS:0000000000000000
> > [  222.174153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  222.179932] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4:
> > 00000000007726e0
> > [  222.187109] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [  222.194283] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [  222.201457] PKRU: 55555554
> > [  222.204178] Call Trace:
> > [  222.206638]  __handle_mm_fault+0x5ad/0x6e0
> > [  222.210760]  ? sysvec_call_function_single+0xb/0x90
> > [  222.215672]  handle_mm_fault+0xc5/0x290
> > [  222.219529]  do_user_addr_fault+0x1a9/0x660
> > [  222.223740]  ? sched_clock_cpu+0xc/0xa0
> > [  222.227602]  exc_page_fault+0x68/0x130
> > [  222.231373]  ? asm_exc_page_fault+0x8/0x30
> > [  222.235495]  asm_exc_page_fault+0x1e/0x30
> > [  222.239526] RIP: 0033:0x7f67baaed734
> > [  222.243120] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00 31 c0
> > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 22
> > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7
> > [  222.262002] RSP: 002b:00007f6754aea298 EFLAGS: 00010287
> > [  222.267257] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > 0000000000000000
> > [  222.274432] RDX: 00007f676ffff700 RSI: 00007f676ffff9c0 RDI:
> > 00007f676f7fec10
> > [  222.281609] RBP: 0000000000000001 R08: 00007f676f7fed10 R09:
> > 00007f67bad012f0
> > [  222.288785] R10: 00007f6754aeb700 R11: 0000000000000202 R12:
> > 0000000000000001
> > [  222.295961] R13: 0000000000000006 R14: 0000000000000e28 R15:
> > 00007f674006e1f0
> > [  222.303137] Modules linked in: vhost_net(E) vhost(E) vhost_iotlb(E)
> > tap(E) tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E)
> > nf_tables(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E)
> > binfmt_misc(E) iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E)
> > bonding(E) tls(E) vfat(E) fat(E) dm_service_time(E) dm_multipath(E)
> > rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E)
> > target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E)
> > intel_rapl_msr(E) intel_rapl_common(E) scsi_transport_iscsi(E)
> > isst_if_common(E) ipmi_ssif(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E)
> > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)
> > crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) qedr(E)
> > mei_me(E) acpi_ipmi(E) ib_uverbs(E) intel_cstate(E) ipmi_si(E) ib_core(E)
> > ipmi_devintf(E) dm_mod(E) ioatdma(E) ses(E) intel_uncore(E) pcspkr(E)
> > enclosure(E) mei(E) hpwdt(E) hpilo(E) lpc_ich(E) intel_pch_thermal(E)
> > dca(E) ipmi_msghandler(E)
> > [  222.303181]  acpi_power_meter(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E)
> > t10_pi(E) sg(E) qedf(E) qede(E) libfcoe(E) qed(E) libfc(E) smartpqi(E)
> > scsi_transport_fc(E) tg3(E) scsi_transport_sas(E) crc8(E) wmi(E)
> > nf_conntrack(E) libcrc32c(E) crc32c_intel(E) nf_defrag_ipv6(E)
> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E)
> > [  222.420050] ---[ end trace bcf7b6d1610cc21f ]---
> > [  222.572925] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > [  222.578469] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 e2 00
> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff
> > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > [  222.597359] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246
> > [  222.602620] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX:
> > ffffffffffffffff
> > [  222.609807] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI:
> > ffffdf55c52cf368
> > [  222.616990] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09:
> > 0000000000000000
> > [  222.624177] R10: 0000000000000000 R11: 0000000000000000 R12:
> > 0000000000000bf8
> > [  222.631361] R13: 0400000000000000 R14: 0400000000000080 R15:
> > ffff9eec2825b1f8
> > [  222.638548] FS:  00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000)
> > knlGS:0000000000000000
> > [  222.646694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  222.652481] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4:
> > 00000000007726e0
> > [  222.659665] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [  222.666850] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [  222.674031] PKRU: 55555554
> > [  222.676758] Kernel panic - not syncing: Fatal exception
> > [  222.817538] Kernel Offset: 0x16000000 from 0xffffffff81000000
> > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [  222.965540] ---[ end Kernel panic - not syncing: Fatal exception ]---
> > 
> > On Sun, Jul 11, 2021 at 8:06 AM Igor Raits <igor@gooddata.com> wrote:
> > 
> > > Hi Hugh,
> > >
> > > On Sun, Jul 11, 2021 at 6:17 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > >> On Sat, 10 Jul 2021, Igor Raits wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I've seen one weird bug on 5.12.14 that happened a couple of times when
> > >> I
> > >> > started a bunch of VMs on a server.
> > >>
> > >> Would it be possible for you to try the same on a 5.12.13 kernel?
> > >> Perhaps by reverting the diff between 5.12.13 and 5.12.14 temporarily.
> > >> Enough to form an impression of whether the issue is new in 5.12.14.
> > >>
> > >
> > > We've been using 5.12.12 for quite some time (~ a month) and I never saw
> > > it there.
> > >
> > > But I have to admit that I don't really have a reproducer. For example, on
> > > servers where it happened,
> > > I just rebooted them and panic did not happen anymore (so I saw it only
> > > only once,
> > > only on 2 servers out of 32 that we have on 5.12.14).
> > >
> > >
> > >> I ask because 5.12.14 did include several fixes and cleanups from me
> > >> to page_vma_mapped_walk(), and that is involved in inserting and
> > >> removing pmd migration entries.  I am not aware of introducing any
> > >> bug there, but your report has got me worried.  If it's happening in
> > >> 5.12.14 but not in 5.12.13, then I must look again at my changes.
> > >>
> > >> I don't expect Hillf's patch to help at at all: the pmd_lock()
> > >> is supposed to be taken by page_vma_mapped_walk(), before
> > >> set_pmd_migration_entry() and remove_migration_pmd() are called.
> > >>
> > >> Thanks,
> > >> Hugh
> > >>
> > >> >
> > >> > I've briefly googled this problem but could not find any relevant commit
> > >> > that would fix this issue.
> > >> >
> > >> > Do you have any hint how to debug this further or know the fix by any
> > >> > chance?
> > >> >
> > >> > Thanks in advance. Stack trace following:
> > >> >
> > >> > [  376.876610] ------------[ cut here ]------------
> > >> > [  376.881274] kernel BUG at include/linux/swapops.h:204!
> > >> > [  376.886455] invalid opcode: 0000 [#1] SMP NOPTI
> > >> > [  376.891014] CPU: 40 PID: 11775 Comm: rpc-worker Tainted: G
> > >>   E
> > >> >     5.12.14-1.gdc.el8.x86_64 #1
> > >> > [  376.900464] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
> > >> > Gen10, BIOS U30 05/24/2021
> > >> > [  376.909038] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > >> > [  376.914562] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff 48 81 e2
> > >> 00
> > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff
> > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > >> > [  376.933443] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246
> > >> > [  376.938701] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX:
> > >> > ffffffffffffffff
> > >> > [  376.945878] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI:
> > >> > fffff497473b2ae8
> > >> > [  376.953055] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09:
> > >> > 0000000000000000
> > >> > [  376.960230] R10: 0000000000000000 R11: 0000000000000000 R12:
> > >> > 0000000000000af8
> > >> > [  376.967407] R13: 0400000000000000 R14: 0400000000000080 R15:
> > >> > ffff908bbef7b6a8
> > >> > [  376.974582] FS:  00007f5bb1f81700(0000) GS:ffff90e87fd80000(0000)
> > >> > knlGS:0000000000000000
> > >> > [  376.982718] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> > [  376.988497] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4:
> > >> > 00000000007726e0
> > >> > [  376.995673] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > >> > 0000000000000000
> > >> > [  377.002849] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > >> > 0000000000000400
> > >> > [  377.010026] PKRU: 55555554
> > >> > [  377.012745] Call Trace:
> > >> > [  377.015207]  __handle_mm_fault+0x5ad/0x6e0
> > >> > [  377.019335]  handle_mm_fault+0xc5/0x290
> > >> > [  377.023194]  do_user_addr_fault+0x1cd/0x740
> > >> > [  377.027406]  exc_page_fault+0x54/0x110
> > >> > [  377.031182]  ? asm_exc_page_fault+0x8/0x30
> > >> > [  377.035307]  asm_exc_page_fault+0x1e/0x30
> > >> > [  377.039340] RIP: 0033:0x7f5bb91d6734
> > >> > [  377.042937] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00 31
> > >> c0
> > >> > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 22
> > >> > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7
> > >> > [  377.061820] RSP: 002b:00007f5bb1f7ff58 EFLAGS: 00010206
> > >> > [  377.067076] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > >> > 00007f5ba0000020
> > >> > [  377.074255] RDX: 00007f5b2bfff700 RSI: 00007f5b2bfff9c0 RDI:
> > >> > 0000000000000001
> > >> > [  377.081429] RBP: 0000000000000001 R08: 0000000000000000 R09:
> > >> > 00007f5bb93ea2f0
> > >> > [  377.088606] R10: 00007f5bb1f81700 R11: 0000000000000202 R12:
> > >> > 0000000000000001
> > >> > [  377.095782] R13: 0000000000000006 R14: 0000000000000cb4 R15:
> > >> > 00007f5bb1f801f0
> > >> > [  377.102958] Modules linked in: ebt_arp(E) nft_meta_bridge(E)
> > >> > ip6_tables(E) xt_CT(E) nf_log_ipv4(E) nf_log_common(E) nft_limit(E)
> > >> > nft_counter(E) xt_LOG(E) xt_limit(E) xt_mac(E) xt_set(E) xt_multiport(E)
> > >> > xt_state(E) xt_conntrack(E) xt_comment(E) xt_physdev(E) nft_compat(E)
> > >> > ip_set_hash_net(E) ip_set(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E)
> > >> > tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) nf_tables(E)
> > >> > vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) binfmt_misc(E)
> > >> > iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E) tls(E)
> > >> > vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E) sunrpc(E)
> > >> > rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E)
> > >> target_core_mod(E)
> > >> > ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E)
> > >> scsi_transport_iscsi(E)
> > >> > intel_rapl_msr(E) qedr(E) intel_rapl_common(E) ib_uverbs(E)
> > >> > isst_if_common(E) ib_core(E) nfit(E) libnvdimm(E)
> > >> x86_pkg_temp_thermal(E)
> > >> > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)
> > >> > crct10dif_pclmul(E)
> > >> > [  377.102999]  crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E)
> > >> > intel_cstate(E) ipmi_ssif(E) acpi_ipmi(E) ipmi_si(E) mei_me(E)
> > >> ioatdma(E)
> > >> > ipmi_devintf(E) dm_mod(E) ses(E) intel_uncore(E) pcspkr(E) qede(E)
> > >> > enclosure(E) tg3(E) mei(E) lpc_ich(E) hpilo(E) hpwdt(E)
> > >> > intel_pch_thermal(E) dca(E) ipmi_msghandler(E) acpi_power_meter(E)
> > >> ext4(E)
> > >> > mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf(E) qed(E) crc8(E)
> > >> > libfcoe(E) libfc(E) smartpqi(E) scsi_transport_fc(E)
> > >> scsi_transport_sas(E)
> > >> > wmi(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) crc32c_intel(E)
> > >> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E)
> > >> > [  377.243468] ---[ end trace 04bce3bb051f7620 ]---
> > >> > [  377.385645] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> > >> > [  377.391194] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff 48 81 e2
> > >> 00
> > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff
> > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
> > >> > [  377.410091] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246
> > >> > [  377.415355] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX:
> > >> > ffffffffffffffff
> > >> > [  377.422540] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI:
> > >> > fffff497473b2ae8
> > >> > [  377.429721] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09:
> > >> > 0000000000000000
> > >> > [  377.436902] R10: 0000000000000000 R11: 0000000000000000 R12:
> > >> > 0000000000000af8
> > >> > [  377.444086] R13: 0400000000000000 R14: 0400000000000080 R15:
> > >> > ffff908bbef7b6a8
> > >> > [  377.451272] FS:  00007f5bb1f81700(0000) GS:ffff90e87fd80000(0000)
> > >> > knlGS:0000000000000000
> > >> > [  377.459415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> > [  377.465196] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4:
> > >> > 00000000007726e0
> > >> > [  377.472377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > >> > 0000000000000000
> > >> > [  377.479556] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > >> > 0000000000000400
> > >> > [  377.486738] PKRU: 55555554
> > >> > [  377.489465] Kernel panic - not syncing: Fatal exception
> > >> > [  377.573911] Kernel Offset: 0xa000000 from 0xffffffff81000000
> > >> (relocation
> > >> > range: 0xffffffff80000000-0xffffffffbfffffff)
> > >> > [  377.716482] ---[ end Kernel panic - not syncing: Fatal exception ]---

Disassembly of the vmlinux Igor sent (along with other info) confirmed
something I suspected, that R08: fffff49747fa8080 in one of the dumps,
R08: ffffdf57428d8080 in the other, is the relevant struct page pointer
(and RAX the page->flags, which look like it was pointing at a good page).

A page pointer ....8080 in pmd_migration_entry_wait() is interesting:
normally I'd expect that to be ....0000 or ....8000, pointing to the
head of a huge page.  But instead it's pointing to the second tail
(though by now that compound page has been freed, and head pointers in
the tails reset to 0): as if the pfn has been incremented by 2 somehow.

And if the pfn (swp_offset) in the migration entry has got corrupted,
then it's no surprise that when removing migration entries,
page_vma_mapped_walk() would see migration_entry_to_page(entry) != page,
so be unable to replace that migration entry, leaving it behind for the
user to hit BUG_ON(!PageLocked) in pmd_migration_entry_wait() when
faulting on it later.

So, what might increment the swp_offset by 2? Hunt around the encodings.
Hmm, _PAGE_BIT_UFFD_WP is _PAGE_BIT_SOFTW2 which is bit 10, whereas
_PAGE_BIT_PROTNONE (top bit to be avoided in pte encoding of swap)
is _PAGE_BIT_GLOBAL is bit 8. After overcoming off-by-one confusions,
it looks like if something somewhere were to set _PAGE_BIT_UFFD_WP
in a migration pmd (whereas it's only suitable for a present pmd),
it would indeed increment the swp_offset by 2.

Hunt for uffd_wps, and run across copy_huge_pmd() in mm/huge_memory.c:
in Igor's 5.13.1 and 5.12.14 and many others, that says
	if (!(vma->vm_flags & VM_UFFD_WP))
		pmd = pmd_clear_uffd_wp(pmd);
just *before* checking is_swap_pmd(). Fixed in 5.14-rc1 in commit
8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()").

But clearing the bit would be harmless, wouldn't it? Because it wouldn't
be set anyway. Waste a day before remembering what I never forgot but
somehow blanked out: the L1TF "feature" forced us to invert the offset
bits in the pte encoding of a swap entry, so there really is a bit set
there in the pmd entry, and clearing it has the effect of setting it in
the corresponding swap entry, so incrementing the migration pfn by 2.

I cannot explain why Igor never saw this crash on 5.12.12: maybe
something else in the environment changed around that time.  And it
will take several days for it to be confirmed as the fix in practice.

But I'm confident that 8f34f1eac382 will prove to be the fix, so Peter
please prepare some backports of that for the various stable/longterm
kernels that need it - I've not looked into whether it applies cleanly,
or depends on other commits too.  You fixed several related but different
things in that commit: but this one is the worst, because it can corrupt
even those who are not using UFFD_WP at all.

Many thans for reporting and helping, Igor.
Hugh

p.s. Peter, unrelated to this particular bug, and should not divert from
fixing it: but looking again at those swap encodings, and particularly
the soft_dirty manipulations: they look very fragile. I think uffd_wp
was wrong to follow that bad example, and your upcoming new encoding
(that I have previously called elegant) takes it a worse step further.

I think we should change to a rule where the architecture-independent
swp_entry_t contains *all* the info, including bits for soft_dirty and
uffd_wp, so that swap entry cases can move immediately to decoding from
arch-dependent pte to arch-independent swp_entry_t, and do all the
manipulations on that. But I don't have time to make that change, and
probably neither do you, and making the change is liable to introduce
errors itself. So, no immediate plans, but please keep in mind.


  reply	other threads:[~2021-07-19 19:11 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-10  7:33 Igor Raits
2021-07-10 12:46 ` Hillf Danton
2021-07-11  4:17 ` Hugh Dickins
2021-07-11  6:06   ` Igor Raits
2021-07-15 17:47     ` Igor Raits
2021-07-16 19:45       ` Hugh Dickins
2021-07-19 19:11         ` Hugh Dickins [this message]
2021-07-19 22:12           ` Peter Xu
2021-07-19 22:42             ` Hugh Dickins
2021-07-20  0:34               ` Peter Xu
2021-07-20  3:31                 ` Hugh Dickins
2021-07-20  7:47             ` Igor Raits
2021-07-20 16:01               ` Peter Xu
2021-07-20 16:05                 ` Igor Raits
2021-07-20 15:51           ` [PATCH stable 5.13.y/5.12.y 0/2] mm/thp: Fix uffd-wp with fork(); crash on pmd migration entry on fork Peter Xu
2021-07-20 15:51             ` [PATCH stable 5.13.y/5.12.y 1/2] mm/thp: simplify copying of huge zero page pmd when fork Peter Xu
2021-07-20 15:51             ` [PATCH stable 5.13.y/5.12.y 2/2] mm/userfaultfd: fix uffd-wp special cases for fork() Peter Xu
2021-07-20 20:32             ` [PATCH stable 5.13.y/5.12.y 0/2] mm/thp: Fix uffd-wp with fork(); crash on pmd migration entry on fork Hugh Dickins
2021-07-22 14:02               ` Greg KH
2021-07-20 15:56           ` [PATCH stable 5.10.y " Peter Xu
2021-07-20 15:56             ` [PATCH stable 5.10.y 1/2] mm/thp: simplify copying of huge zero page pmd when fork Peter Xu
2021-07-20 15:56             ` [PATCH stable 5.10.y 2/2] mm/userfaultfd: fix uffd-wp special cases for fork() Peter Xu
2021-07-20 20:38             ` [PATCH stable 5.10.y 0/2] mm/thp: Fix uffd-wp with fork(); crash on pmd migration entry on fork Hugh Dickins
2021-07-22 14:05               ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=796cbb7-5a1c-1ba0-dde5-479aba8224f2@google.com \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=hdanton@sina.com \
    --cc=igor@gooddata.com \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --subject='Re: kernel BUG at include/linux/swapops.h:204'\!'' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).