From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_KAM_HTML_FONT_INVALID, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90C4BC636C8 for ; Tue, 20 Jul 2021 07:47:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D712661165 for ; Tue, 20 Jul 2021 07:47:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D712661165 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gooddata.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 802206B00A5; Tue, 20 Jul 2021 03:47:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7893A6B00A6; Tue, 20 Jul 2021 03:47:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 58C2D8D0001; Tue, 20 Jul 2021 03:47:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0084.hostedemail.com [216.40.44.84]) by kanga.kvack.org (Postfix) with ESMTP id 0CB166B00A5 for ; Tue, 20 Jul 2021 03:47:17 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 56FC08248047 for ; Tue, 20 Jul 2021 07:47:15 +0000 (UTC) X-FDA: 78382185630.24.47AB3D0 Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54]) by imf27.hostedemail.com (Postfix) with ESMTP id C47B170009E1 for ; Tue, 20 Jul 2021 07:47:14 +0000 (UTC) Received: by mail-ed1-f54.google.com with SMTP id t3so27290880edc.7 for ; Tue, 20 Jul 2021 00:47:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gooddata.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=yAITm9wG6OnbEvxwA4te0BUTZURBLqIslcQUxDmDJxo=; b=mNp3PtZQhh4fZelAHQGx8zmUcpVhLjBav18N4YKqDfvle6C2M6paUX4gppEcxj89ty 7KhG7zgA5cQbfv9tPgOQC9xRMLSGgxvFBL+F6NblXAXkNynhb8MIKUAOalplsiValXVU hTgDvFqv/wdXE/qX4UtWOpWt2kxPsIN8jME4U= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=yAITm9wG6OnbEvxwA4te0BUTZURBLqIslcQUxDmDJxo=; b=bWHl5iJQINMRDs8ejh0UshSgaaqLfUMUCHqZsDR/jw9U/DW0f+Biou/yBRtvHId5Rq kW6RmHDqUdyLM0DBrcZclXp8sq0cxcIvqmAEqrRxdBHWzFSeDUTP0l0c4wr424tdiD8O GEg+uHw25MqoawKctT3NpjZU41caSWEhCMk9Hm0aNAOkb1xt+5lKqmh5SiJRAtwOCq5j VPE4J6PSasTJL8tNm9KQyPUA5TUf4ThL+IOmNBq5C1W4yNoeWsCXNkzCTDtQzT3ucyCM HEFPDB8RV+x8eZejcKkWjK+bafFpoyR73oHEH51e15s2t6WkVkhaHb+uXp3E4/0dSYw1 Zi2A== X-Gm-Message-State: AOAM531PFXobr0gY+VCvPgeOqdRFiXZwVrOGtQuvaJpc73qd3saBjQc9 x5efpiIi8a3QgdO8Jje01JTaeaWmmJLSSunS0jJRyA== X-Google-Smtp-Source: ABdhPJyjwZGIL7Wl3CP0dEAJZeFG9S4DkELrYkg9THNlQUKBsfpbGqm1fFPPRso/ji6S+t3YRM8lNBPBwZ4XqCUG6M8= X-Received: by 2002:aa7:db54:: with SMTP id n20mr39214688edt.21.1626767233338; Tue, 20 Jul 2021 00:47:13 -0700 (PDT) MIME-Version: 1.0 References: <4c9e24db-29d5-5bbb-17ae-8dc32ceb66ed@google.com> <796cbb7-5a1c-1ba0-dde5-479aba8224f2@google.com> In-Reply-To: From: Igor Raits Date: Tue, 20 Jul 2021 09:47:02 +0200 Message-ID: Subject: Re: kernel BUG at include/linux/swapops.h:204! To: Peter Xu Cc: Hugh Dickins , Andrew Morton , Hillf Danton , Axel Rasmussen , linux-mm@kvack.org Content-Type: multipart/alternative; boundary="00000000000013642705c7894307" Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b=mNp3PtZQ; spf=pass (imf27.hostedemail.com: domain of igor.raits@gooddata.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=igor.raits@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: C47B170009E1 X-Stat-Signature: xeweznaweq4afxz91rpbsof3fttkn4gw X-HE-Tag: 1626767234-176871 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --00000000000013642705c7894307 Content-Type: text/plain; charset="UTF-8" On Tue, Jul 20, 2021 at 12:13 AM Peter Xu wrote: > On Mon, Jul 19, 2021 at 12:11:21PM -0700, Hugh Dickins wrote: > > Hi Peter, > > Hi, Hugh, > > > > > I believe you have already fixed this, but the fix needs to go to stable. > > Sorry, the messages below are a muddle of top and middle posting, > > I'll resume at the bottom. > > > > On Fri, 16 Jul 2021, Hugh Dickins wrote: > > > On Thu, 15 Jul 2021, Igor Raits wrote: > > > > > > > Hi everyone again, > > > > > > > > I've been trying to reproduce this issue but still can't find a > consistent > > > > pattern. > > > > > > > > However, it did happen once more and this time on 5.13.1: > > > > > > Thanks for the updates, Igor. > > > > > > I have to admit that what you have reported confirms the suspicion > > > that it's a bug introduced by one of my "stable" patches in 5.12.14 > > > (which are also in 5.13): nothing else between 5.12.12 and 5.12.14 > > > seems likely to be relevant. > > > > > > But I've gone back and forth and not been able to spot the problem. > > > > > > Please would you send (either privately to me, or to the list) your > > > 5.13.1 kernel's .config, and disassembly of pmd_migration_entry_wait() > > > from its vmlinux (with line numbers if available; or just send the > > > whole vmlinux if that's easier, and I'll disassemble). > > > > > > I am hoping that the disassembly, together with the register contents > > > that you've shown, will help guide towards an answer. > > > > > > Thanks, > > > Hugh > > > > > > > > > > > [ 222.068216] ------------[ cut here ]------------ > > > > [ 222.072884] kernel BUG at include/linux/swapops.h:204! > > > > [ 222.078062] invalid opcode: 0000 [#1] SMP NOPTI > > > > [ 222.082618] CPU: 38 PID: 9828 Comm: rpc-worker Tainted: G > E > > > > 5.13.1-1.gdc.el8.x86_64 #1 > > > > [ 222.091894] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 > > > > Gen10, BIOS U30 05/24/2021 > > > > [ 222.100468] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 > > > > [ 222.105994] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 > e2 00 > > > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff > ff > > > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48 > > > > [ 222.124878] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246 > > > > [ 222.130134] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX: > > > > ffffffffffffffff > > > > [ 222.137309] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI: > > > > ffffdf55c52cf368 > > > > [ 222.144485] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09: > > > > 0000000000000000 > > > > [ 222.151661] R10: 0000000000000000 R11: 0000000000000000 R12: > > > > 0000000000000bf8 > > > > [ 222.158837] R13: 0400000000000000 R14: 0400000000000080 R15: > > > > ffff9eec2825b1f8 > > > > [ 222.166015] FS: 00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000) > > > > knlGS:0000000000000000 > > > > [ 222.174153] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [ 222.179932] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4: > > > > 00000000007726e0 > > > > [ 222.187109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > 0000000000000000 > > > > [ 222.194283] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > 0000000000000400 > > > > [ 222.201457] PKRU: 55555554 > > > > [ 222.204178] Call Trace: > > > > [ 222.206638] __handle_mm_fault+0x5ad/0x6e0 > > > > [ 222.210760] ? sysvec_call_function_single+0xb/0x90 > > > > [ 222.215672] handle_mm_fault+0xc5/0x290 > > > > [ 222.219529] do_user_addr_fault+0x1a9/0x660 > > > > [ 222.223740] ? sched_clock_cpu+0xc/0xa0 > > > > [ 222.227602] exc_page_fault+0x68/0x130 > > > > [ 222.231373] ? asm_exc_page_fault+0x8/0x30 > > > > [ 222.235495] asm_exc_page_fault+0x1e/0x30 > > > > [ 222.239526] RIP: 0033:0x7f67baaed734 > > > > [ 222.243120] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00 > 31 c0 > > > > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 > 22 > > > > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7 > > > > [ 222.262002] RSP: 002b:00007f6754aea298 EFLAGS: 00010287 > > > > [ 222.267257] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > > > > 0000000000000000 > > > > [ 222.274432] RDX: 00007f676ffff700 RSI: 00007f676ffff9c0 RDI: > > > > 00007f676f7fec10 > > > > [ 222.281609] RBP: 0000000000000001 R08: 00007f676f7fed10 R09: > > > > 00007f67bad012f0 > > > > [ 222.288785] R10: 00007f6754aeb700 R11: 0000000000000202 R12: > > > > 0000000000000001 > > > > [ 222.295961] R13: 0000000000000006 R14: 0000000000000e28 R15: > > > > 00007f674006e1f0 > > > > [ 222.303137] Modules linked in: vhost_net(E) vhost(E) > vhost_iotlb(E) > > > > tap(E) tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) > > > > nf_tables(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) > > > > binfmt_misc(E) iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) > > > > bonding(E) tls(E) vfat(E) fat(E) dm_service_time(E) dm_multipath(E) > > > > rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E) > iscsi_target_mod(E) > > > > target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) > libiscsi(E) > > > > intel_rapl_msr(E) intel_rapl_common(E) scsi_transport_iscsi(E) > > > > isst_if_common(E) ipmi_ssif(E) nfit(E) libnvdimm(E) > x86_pkg_temp_thermal(E) > > > > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) > > > > crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) > qedr(E) > > > > mei_me(E) acpi_ipmi(E) ib_uverbs(E) intel_cstate(E) ipmi_si(E) > ib_core(E) > > > > ipmi_devintf(E) dm_mod(E) ioatdma(E) ses(E) intel_uncore(E) pcspkr(E) > > > > enclosure(E) mei(E) hpwdt(E) hpilo(E) lpc_ich(E) intel_pch_thermal(E) > > > > dca(E) ipmi_msghandler(E) > > > > [ 222.303181] acpi_power_meter(E) ext4(E) mbcache(E) jbd2(E) > sd_mod(E) > > > > t10_pi(E) sg(E) qedf(E) qede(E) libfcoe(E) qed(E) libfc(E) > smartpqi(E) > > > > scsi_transport_fc(E) tg3(E) scsi_transport_sas(E) crc8(E) wmi(E) > > > > nf_conntrack(E) libcrc32c(E) crc32c_intel(E) nf_defrag_ipv6(E) > > > > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) > > > > [ 222.420050] ---[ end trace bcf7b6d1610cc21f ]--- > > > > [ 222.572925] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 > > > > [ 222.578469] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 > e2 00 > > > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff > ff > > > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48 > > > > [ 222.597359] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246 > > > > [ 222.602620] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX: > > > > ffffffffffffffff > > > > [ 222.609807] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI: > > > > ffffdf55c52cf368 > > > > [ 222.616990] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09: > > > > 0000000000000000 > > > > [ 222.624177] R10: 0000000000000000 R11: 0000000000000000 R12: > > > > 0000000000000bf8 > > > > [ 222.631361] R13: 0400000000000000 R14: 0400000000000080 R15: > > > > ffff9eec2825b1f8 > > > > [ 222.638548] FS: 00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000) > > > > knlGS:0000000000000000 > > > > [ 222.646694] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > [ 222.652481] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4: > > > > 00000000007726e0 > > > > [ 222.659665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > 0000000000000000 > > > > [ 222.666850] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > 0000000000000400 > > > > [ 222.674031] PKRU: 55555554 > > > > [ 222.676758] Kernel panic - not syncing: Fatal exception > > > > [ 222.817538] Kernel Offset: 0x16000000 from 0xffffffff81000000 > > > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > > > [ 222.965540] ---[ end Kernel panic - not syncing: Fatal exception > ]--- > > > > > > > > On Sun, Jul 11, 2021 at 8:06 AM Igor Raits > wrote: > > > > > > > > > Hi Hugh, > > > > > > > > > > On Sun, Jul 11, 2021 at 6:17 AM Hugh Dickins > wrote: > > > > > > > > > >> On Sat, 10 Jul 2021, Igor Raits wrote: > > > > >> > > > > >> > Hello, > > > > >> > > > > > >> > I've seen one weird bug on 5.12.14 that happened a couple of > times when > > > > >> I > > > > >> > started a bunch of VMs on a server. > > > > >> > > > > >> Would it be possible for you to try the same on a 5.12.13 kernel? > > > > >> Perhaps by reverting the diff between 5.12.13 and 5.12.14 > temporarily. > > > > >> Enough to form an impression of whether the issue is new in > 5.12.14. > > > > >> > > > > > > > > > > We've been using 5.12.12 for quite some time (~ a month) and I > never saw > > > > > it there. > > > > > > > > > > But I have to admit that I don't really have a reproducer. For > example, on > > > > > servers where it happened, > > > > > I just rebooted them and panic did not happen anymore (so I saw it > only > > > > > only once, > > > > > only on 2 servers out of 32 that we have on 5.12.14). > > > > > > > > > > > > > > >> I ask because 5.12.14 did include several fixes and cleanups from > me > > > > >> to page_vma_mapped_walk(), and that is involved in inserting and > > > > >> removing pmd migration entries. I am not aware of introducing any > > > > >> bug there, but your report has got me worried. If it's happening > in > > > > >> 5.12.14 but not in 5.12.13, then I must look again at my changes. > > > > >> > > > > >> I don't expect Hillf's patch to help at at all: the pmd_lock() > > > > >> is supposed to be taken by page_vma_mapped_walk(), before > > > > >> set_pmd_migration_entry() and remove_migration_pmd() are called. > > > > >> > > > > >> Thanks, > > > > >> Hugh > > > > >> > > > > >> > > > > > >> > I've briefly googled this problem but could not find any > relevant commit > > > > >> > that would fix this issue. > > > > >> > > > > > >> > Do you have any hint how to debug this further or know the fix > by any > > > > >> > chance? > > > > >> > > > > > >> > Thanks in advance. Stack trace following: > > > > >> > > > > > >> > [ 376.876610] ------------[ cut here ]------------ > > > > >> > [ 376.881274] kernel BUG at include/linux/swapops.h:204! > > > > >> > [ 376.886455] invalid opcode: 0000 [#1] SMP NOPTI > > > > >> > [ 376.891014] CPU: 40 PID: 11775 Comm: rpc-worker Tainted: G > > > > >> E > > > > >> > 5.12.14-1.gdc.el8.x86_64 #1 > > > > >> > [ 376.900464] Hardware name: HPE ProLiant DL380 Gen10/ProLiant > DL380 > > > > >> > Gen10, BIOS U30 05/24/2021 > > > > >> > [ 376.909038] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 > > > > >> > [ 376.914562] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff > 48 81 e2 > > > > >> 00 > > > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 > ff ff ff > > > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 > 55 48 > > > > >> > [ 376.933443] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246 > > > > >> > [ 376.938701] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX: > > > > >> > ffffffffffffffff > > > > >> > [ 376.945878] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI: > > > > >> > fffff497473b2ae8 > > > > >> > [ 376.953055] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09: > > > > >> > 0000000000000000 > > > > >> > [ 376.960230] R10: 0000000000000000 R11: 0000000000000000 R12: > > > > >> > 0000000000000af8 > > > > >> > [ 376.967407] R13: 0400000000000000 R14: 0400000000000080 R15: > > > > >> > ffff908bbef7b6a8 > > > > >> > [ 376.974582] FS: 00007f5bb1f81700(0000) > GS:ffff90e87fd80000(0000) > > > > >> > knlGS:0000000000000000 > > > > >> > [ 376.982718] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > >> > [ 376.988497] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4: > > > > >> > 00000000007726e0 > > > > >> > [ 376.995673] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > >> > 0000000000000000 > > > > >> > [ 377.002849] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > >> > 0000000000000400 > > > > >> > [ 377.010026] PKRU: 55555554 > > > > >> > [ 377.012745] Call Trace: > > > > >> > [ 377.015207] __handle_mm_fault+0x5ad/0x6e0 > > > > >> > [ 377.019335] handle_mm_fault+0xc5/0x290 > > > > >> > [ 377.023194] do_user_addr_fault+0x1cd/0x740 > > > > >> > [ 377.027406] exc_page_fault+0x54/0x110 > > > > >> > [ 377.031182] ? asm_exc_page_fault+0x8/0x30 > > > > >> > [ 377.035307] asm_exc_page_fault+0x1e/0x30 > > > > >> > [ 377.039340] RIP: 0033:0x7f5bb91d6734 > > > > >> > [ 377.042937] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b > 21 00 31 > > > > >> c0 > > > > >> > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 > d2 74 22 > > > > >> > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 > 00 c7 > > > > >> > [ 377.061820] RSP: 002b:00007f5bb1f7ff58 EFLAGS: 00010206 > > > > >> > [ 377.067076] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > > > > >> > 00007f5ba0000020 > > > > >> > [ 377.074255] RDX: 00007f5b2bfff700 RSI: 00007f5b2bfff9c0 RDI: > > > > >> > 0000000000000001 > > > > >> > [ 377.081429] RBP: 0000000000000001 R08: 0000000000000000 R09: > > > > >> > 00007f5bb93ea2f0 > > > > >> > [ 377.088606] R10: 00007f5bb1f81700 R11: 0000000000000202 R12: > > > > >> > 0000000000000001 > > > > >> > [ 377.095782] R13: 0000000000000006 R14: 0000000000000cb4 R15: > > > > >> > 00007f5bb1f801f0 > > > > >> > [ 377.102958] Modules linked in: ebt_arp(E) nft_meta_bridge(E) > > > > >> > ip6_tables(E) xt_CT(E) nf_log_ipv4(E) nf_log_common(E) > nft_limit(E) > > > > >> > nft_counter(E) xt_LOG(E) xt_limit(E) xt_mac(E) xt_set(E) > xt_multiport(E) > > > > >> > xt_state(E) xt_conntrack(E) xt_comment(E) xt_physdev(E) > nft_compat(E) > > > > >> > ip_set_hash_net(E) ip_set(E) vhost_net(E) vhost(E) > vhost_iotlb(E) tap(E) > > > > >> > tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) > nf_tables(E) > > > > >> > vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) > binfmt_misc(E) > > > > >> > iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E) > tls(E) > > > > >> > vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E) > sunrpc(E) > > > > >> > rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) > > > > >> target_core_mod(E) > > > > >> > ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E) > > > > >> scsi_transport_iscsi(E) > > > > >> > intel_rapl_msr(E) qedr(E) intel_rapl_common(E) ib_uverbs(E) > > > > >> > isst_if_common(E) ib_core(E) nfit(E) libnvdimm(E) > > > > >> x86_pkg_temp_thermal(E) > > > > >> > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) > > > > >> > crct10dif_pclmul(E) > > > > >> > [ 377.102999] crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) > > > > >> > intel_cstate(E) ipmi_ssif(E) acpi_ipmi(E) ipmi_si(E) mei_me(E) > > > > >> ioatdma(E) > > > > >> > ipmi_devintf(E) dm_mod(E) ses(E) intel_uncore(E) pcspkr(E) > qede(E) > > > > >> > enclosure(E) tg3(E) mei(E) lpc_ich(E) hpilo(E) hpwdt(E) > > > > >> > intel_pch_thermal(E) dca(E) ipmi_msghandler(E) > acpi_power_meter(E) > > > > >> ext4(E) > > > > >> > mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf(E) qed(E) > crc8(E) > > > > >> > libfcoe(E) libfc(E) smartpqi(E) scsi_transport_fc(E) > > > > >> scsi_transport_sas(E) > > > > >> > wmi(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) > crc32c_intel(E) > > > > >> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) > > > > >> > [ 377.243468] ---[ end trace 04bce3bb051f7620 ]--- > > > > >> > [ 377.385645] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 > > > > >> > [ 377.391194] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff > 48 81 e2 > > > > >> 00 > > > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 > ff ff ff > > > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 > 55 48 > > > > >> > [ 377.410091] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246 > > > > >> > [ 377.415355] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX: > > > > >> > ffffffffffffffff > > > > >> > [ 377.422540] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI: > > > > >> > fffff497473b2ae8 > > > > >> > [ 377.429721] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09: > > > > >> > 0000000000000000 > > > > >> > [ 377.436902] R10: 0000000000000000 R11: 0000000000000000 R12: > > > > >> > 0000000000000af8 > > > > >> > [ 377.444086] R13: 0400000000000000 R14: 0400000000000080 R15: > > > > >> > ffff908bbef7b6a8 > > > > >> > [ 377.451272] FS: 00007f5bb1f81700(0000) > GS:ffff90e87fd80000(0000) > > > > >> > knlGS:0000000000000000 > > > > >> > [ 377.459415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > > >> > [ 377.465196] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4: > > > > >> > 00000000007726e0 > > > > >> > [ 377.472377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > > >> > 0000000000000000 > > > > >> > [ 377.479556] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > > >> > 0000000000000400 > > > > >> > [ 377.486738] PKRU: 55555554 > > > > >> > [ 377.489465] Kernel panic - not syncing: Fatal exception > > > > >> > [ 377.573911] Kernel Offset: 0xa000000 from 0xffffffff81000000 > > > > >> (relocation > > > > >> > range: 0xffffffff80000000-0xffffffffbfffffff) > > > > >> > [ 377.716482] ---[ end Kernel panic - not syncing: Fatal > exception ]--- > > > > Disassembly of the vmlinux Igor sent (along with other info) confirmed > > something I suspected, that R08: fffff49747fa8080 in one of the dumps, > > R08: ffffdf57428d8080 in the other, is the relevant struct page pointer > > (and RAX the page->flags, which look like it was pointing at a good > page). > > > > A page pointer ....8080 in pmd_migration_entry_wait() is interesting: > > normally I'd expect that to be ....0000 or ....8000, pointing to the > > head of a huge page. But instead it's pointing to the second tail > > (though by now that compound page has been freed, and head pointers in > > the tails reset to 0): as if the pfn has been incremented by 2 somehow. > > > > And if the pfn (swp_offset) in the migration entry has got corrupted, > > then it's no surprise that when removing migration entries, > > page_vma_mapped_walk() would see migration_entry_to_page(entry) != page, > > so be unable to replace that migration entry, leaving it behind for the > > user to hit BUG_ON(!PageLocked) in pmd_migration_entry_wait() when > > faulting on it later. > > > > So, what might increment the swp_offset by 2? Hunt around the encodings. > > Hmm, _PAGE_BIT_UFFD_WP is _PAGE_BIT_SOFTW2 which is bit 10, whereas > > _PAGE_BIT_PROTNONE (top bit to be avoided in pte encoding of swap) > > is _PAGE_BIT_GLOBAL is bit 8. After overcoming off-by-one confusions, > > it looks like if something somewhere were to set _PAGE_BIT_UFFD_WP > > in a migration pmd (whereas it's only suitable for a present pmd), > > it would indeed increment the swp_offset by 2. > > > > Hunt for uffd_wps, and run across copy_huge_pmd() in mm/huge_memory.c: > > in Igor's 5.13.1 and 5.12.14 and many others, that says > > if (!(vma->vm_flags & VM_UFFD_WP)) > > pmd = pmd_clear_uffd_wp(pmd); > > just *before* checking is_swap_pmd(). Fixed in 5.14-rc1 in commit > > 8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()"). > > > > But clearing the bit would be harmless, wouldn't it? Because it wouldn't > > be set anyway. Waste a day before remembering what I never forgot but > > somehow blanked out: the L1TF "feature" forced us to invert the offset > > bits in the pte encoding of a swap entry, so there really is a bit set > > there in the pmd entry, and clearing it has the effect of setting it in > > the corresponding swap entry, so incrementing the migration pfn by 2. > > > > I cannot explain why Igor never saw this crash on 5.12.12: maybe > > something else in the environment changed around that time. And it > > will take several days for it to be confirmed as the fix in practice. > > > > But I'm confident that 8f34f1eac382 will prove to be the fix, so Peter > > please prepare some backports of that for the various stable/longterm > > kernels that need it - I've not looked into whether it applies cleanly, > > or depends on other commits too. You fixed several related but different > > things in that commit: but this one is the worst, because it can corrupt > > even those who are not using UFFD_WP at all. > > Looks right to me, b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP > properly > when fork", 2020-04-07) seems to be the culprit. I didn't notice the side > effect in the bug or in the fix, or it should have already land stables. I > am > very sorry for such a preliminary bug that caused this fallout - I really > can't > tell why I completely didn't look at is_swap_pte() that's so obvious > indeed. > > I checked it up, 5.6.y doesn't have the issue commit yet as it's not > marked as > "fixes". It started to show up in 5.7.y~5.13.y. 5.14-rc1 has 8f34f1eac382 > which > is the fix. So I think we need the fix or equivalent fix for 5.7.y~5.13.y. > > 5.12.y & 5.13.y can pick up the fix 8f34f1eac382 cleanly. For the olders > (5.7.y~5.11.y) they can't. I plan to revert b569a1760782 instead. > FTR, even though 8f34f1eac382 applies cleanly it does not compile. The 1st patch of that series is also required (5fc7a5f6fd04) - it removes use of *vma, which is later removed by the patch that fixes the actual problem. > > > > > Many thans for reporting and helping, Igor. > > Hugh > > > > p.s. Peter, unrelated to this particular bug, and should not divert from > > fixing it: but looking again at those swap encodings, and particularly > > the soft_dirty manipulations: they look very fragile. I think uffd_wp > > was wrong to follow that bad example, and your upcoming new encoding > > (that I have previously called elegant) takes it a worse step further. > > > > I think we should change to a rule where the architecture-independent > > swp_entry_t contains *all* the info, including bits for soft_dirty and > > uffd_wp, so that swap entry cases can move immediately to decoding from > > arch-dependent pte to arch-independent swp_entry_t, and do all the > > manipulations on that. But I don't have time to make that change, and > > probably neither do you, and making the change is liable to introduce > > errors itself. So, no immediate plans, but please keep in mind. > > Curious: did we encounter similar issue previously where soft dirty bit is > applied wrongly so causing hard-to-debug issues? > > If this is destined to be the best solution, I can work on both of them. > I am > just worried that's too big a change as you said so we don't know what's > the > most efficient considering total time we use to develop, review and debug > them. > > The other alternative is we fix bugs; I know that's so cheap a word when I > said > it, however we still can't deny it as an option yet. > > We can definitely discuss this out of this thread and I'll prepare the > backport > first. For all the cases, this bug definitely brings some alert, and I'll > keep > that in mind. > > Please let me know if there's any comment on the backport plan above, or > I'll > prepare the patches for all the branches before tomorrow. > > Thanks, > > -- > Peter Xu > > -- Igor Raits Sr. SW Engineer igor@gooddata.com +420 775 117 817 Moravske namesti 1007/14 602 00 Brno-Veveri, Czech Republic Twitter | Facebook | LinkedIn | Blog --00000000000013642705c7894307 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

On Tue, Jul 20, 2021 at 12:13 AM Peter Xu <peterx@redhat.com> wrote:
On Mon, Jul 19, 2021 at 12:11:21PM -070= 0, Hugh Dickins wrote:
> Hi Peter,

Hi, Hugh,

>
> I believe you have already fixed this, but the fix needs to go to stab= le.
> Sorry, the messages below are a muddle of top and middle posting,
> I'll resume at the bottom.
>
> On Fri, 16 Jul 2021, Hugh Dickins wrote:
> > On Thu, 15 Jul 2021, Igor Raits wrote:
> >
> > > Hi everyone again,
> > >
> > > I've been trying to reproduce this issue but still can&#= 39;t find a consistent
> > > pattern.
> > >
> > > However, it did happen once more and this time on 5.13.1: > >
> > Thanks for the updates, Igor.
> >
> > I have to admit that what you have reported confirms the suspicio= n
> > that it's a bug introduced by one of my "stable" pa= tches in 5.12.14
> > (which are also in 5.13): nothing else between 5.12.12 and 5.12.1= 4
> > seems likely to be relevant.
> >
> > But I've gone back and forth and not been able to spot the pr= oblem.
> >
> > Please would you send (either privately to me, or to the list) yo= ur
> > 5.13.1 kernel's .config, and disassembly of pmd_migration_ent= ry_wait()
> > from its vmlinux (with line numbers if available; or just send th= e
> > whole vmlinux if that's easier, and I'll disassemble). > >
> > I am hoping that the disassembly, together with the register cont= ents
> > that you've shown, will help guide towards an answer.
> >
> > Thanks,
> > Hugh
> >
> > >
> > > [=C2=A0 222.068216] ------------[ cut here ]------------
> > > [=C2=A0 222.072884] kernel BUG at include/linux/swapops.h:20= 4!
> > > [=C2=A0 222.078062] invalid opcode: 0000 [#1] SMP NOPTI
> > > [=C2=A0 222.082618] CPU: 38 PID: 9828 Comm: rpc-worker Taint= ed: G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 E
> > >=C2=A0 =C2=A05.13.1-1.gdc.el8.x86_64 #1
> > > [=C2=A0 222.091894] Hardware name: HPE ProLiant DL380 Gen10/= ProLiant DL380
> > > Gen10, BIOS U30 05/24/2021
> > > [=C2=A0 222.100468] RIP: 0010:pmd_migration_entry_wait+0x132= /0x140
> > > [=C2=A0 222.105994] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c= 5 f6 ff 48 81 e2 00
> > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 = 44 ff ff ff
> > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 0= 0 00 41 55 48
> > > [=C2=A0 222.124878] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010= 246
> > > [=C2=A0 222.130134] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cd= bf8 RCX:
> > > ffffffffffffffff
> > > [=C2=A0 222.137309] RDX: 0000000000000000 RSI: ffff9eec4b3cd= bf8 RDI:
> > > ffffdf55c52cf368
> > > [=C2=A0 222.144485] RBP: ffffdf55c52cf368 R08: ffffdf57428d8= 080 R09:
> > > 0000000000000000
> > > [=C2=A0 222.151661] R10: 0000000000000000 R11: 0000000000000= 000 R12:
> > > 0000000000000bf8
> > > [=C2=A0 222.158837] R13: 0400000000000000 R14: 0400000000000= 080 R15:
> > > ffff9eec2825b1f8
> > > [=C2=A0 222.166015] FS:=C2=A0 00007f6754aeb700(0000) GS:ffff= 9f49bfd00000(0000)
> > > knlGS:0000000000000000
> > > [=C2=A0 222.174153] CS:=C2=A0 0010 DS: 0000 ES: 0000 CR0: 00= 00000080050033
> > > [=C2=A0 222.179932] CR2: 00007f676ffffd98 CR3: 000000012bf6a= 002 CR4:
> > > 00000000007726e0
> > > [=C2=A0 222.187109] DR0: 0000000000000000 DR1: 0000000000000= 000 DR2:
> > > 0000000000000000
> > > [=C2=A0 222.194283] DR3: 0000000000000000 DR6: 00000000fffe0= ff0 DR7:
> > > 0000000000000400
> > > [=C2=A0 222.201457] PKRU: 55555554
> > > [=C2=A0 222.204178] Call Trace:
> > > [=C2=A0 222.206638]=C2=A0 __handle_mm_fault+0x5ad/0x6e0
> > > [=C2=A0 222.210760]=C2=A0 ? sysvec_call_function_single+0xb/= 0x90
> > > [=C2=A0 222.215672]=C2=A0 handle_mm_fault+0xc5/0x290
> > > [=C2=A0 222.219529]=C2=A0 do_user_addr_fault+0x1a9/0x660
> > > [=C2=A0 222.223740]=C2=A0 ? sched_clock_cpu+0xc/0xa0
> > > [=C2=A0 222.227602]=C2=A0 exc_page_fault+0x68/0x130
> > > [=C2=A0 222.231373]=C2=A0 ? asm_exc_page_fault+0x8/0x30
> > > [=C2=A0 222.235495]=C2=A0 asm_exc_page_fault+0x1e/0x30
> > > [=C2=A0 222.239526] RIP: 0033:0x7f67baaed734
> > > [=C2=A0 222.243120] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0= d d6 3b 21 00 31 c0
> > > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 = 39 d2 74 22
> > > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 0= 3 00 00 00 c7
> > > [=C2=A0 222.262002] RSP: 002b:00007f6754aea298 EFLAGS: 00010= 287
> > > [=C2=A0 222.267257] RAX: 0000000000000000 RBX: 0000000000000= 000 RCX:
> > > 0000000000000000
> > > [=C2=A0 222.274432] RDX: 00007f676ffff700 RSI: 00007f676ffff= 9c0 RDI:
> > > 00007f676f7fec10
> > > [=C2=A0 222.281609] RBP: 0000000000000001 R08: 00007f676f7fe= d10 R09:
> > > 00007f67bad012f0
> > > [=C2=A0 222.288785] R10: 00007f6754aeb700 R11: 0000000000000= 202 R12:
> > > 0000000000000001
> > > [=C2=A0 222.295961] R13: 0000000000000006 R14: 0000000000000= e28 R15:
> > > 00007f674006e1f0
> > > [=C2=A0 222.303137] Modules linked in: vhost_net(E) vhost(E)= vhost_iotlb(E)
> > > tap(E) tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsol= e(E)
> > > nf_tables(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetl= ink(E)
> > > binfmt_misc(E) iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E)= mrp(E)
> > > bonding(E) tls(E) vfat(E) fat(E) dm_service_time(E) dm_multi= path(E)
> > > rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E) iscs= i_target_mod(E)
> > > target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) l= ibiscsi(E)
> > > intel_rapl_msr(E) intel_rapl_common(E) scsi_transport_iscsi(= E)
> > > isst_if_common(E) ipmi_ssif(E) nfit(E) libnvdimm(E) x86_pkg_= temp_thermal(E)
> > > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypas= s(E)
> > > crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) r= apl(E) qedr(E)
> > > mei_me(E) acpi_ipmi(E) ib_uverbs(E) intel_cstate(E) ipmi_si(= E) ib_core(E)
> > > ipmi_devintf(E) dm_mod(E) ioatdma(E) ses(E) intel_uncore(E) = pcspkr(E)
> > > enclosure(E) mei(E) hpwdt(E) hpilo(E) lpc_ich(E) intel_pch_t= hermal(E)
> > > dca(E) ipmi_msghandler(E)
> > > [=C2=A0 222.303181]=C2=A0 acpi_power_meter(E) ext4(E) mbcach= e(E) jbd2(E) sd_mod(E)
> > > t10_pi(E) sg(E) qedf(E) qede(E) libfcoe(E) qed(E) libfc(E) s= martpqi(E)
> > > scsi_transport_fc(E) tg3(E) scsi_transport_sas(E) crc8(E) wm= i(E)
> > > nf_conntrack(E) libcrc32c(E) crc32c_intel(E) nf_defrag_ipv6(= E)
> > > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) > > > [=C2=A0 222.420050] ---[ end trace bcf7b6d1610cc21f ]---
> > > [=C2=A0 222.572925] RIP: 0010:pmd_migration_entry_wait+0x132= /0x140
> > > [=C2=A0 222.578469] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c= 5 f6 ff 48 81 e2 00
> > > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 = 44 ff ff ff
> > > <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 0= 0 00 41 55 48
> > > [=C2=A0 222.597359] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010= 246
> > > [=C2=A0 222.602620] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cd= bf8 RCX:
> > > ffffffffffffffff
> > > [=C2=A0 222.609807] RDX: 0000000000000000 RSI: ffff9eec4b3cd= bf8 RDI:
> > > ffffdf55c52cf368
> > > [=C2=A0 222.616990] RBP: ffffdf55c52cf368 R08: ffffdf57428d8= 080 R09:
> > > 0000000000000000
> > > [=C2=A0 222.624177] R10: 0000000000000000 R11: 0000000000000= 000 R12:
> > > 0000000000000bf8
> > > [=C2=A0 222.631361] R13: 0400000000000000 R14: 0400000000000= 080 R15:
> > > ffff9eec2825b1f8
> > > [=C2=A0 222.638548] FS:=C2=A0 00007f6754aeb700(0000) GS:ffff= 9f49bfd00000(0000)
> > > knlGS:0000000000000000
> > > [=C2=A0 222.646694] CS:=C2=A0 0010 DS: 0000 ES: 0000 CR0: 00= 00000080050033
> > > [=C2=A0 222.652481] CR2: 00007f676ffffd98 CR3: 000000012bf6a= 002 CR4:
> > > 00000000007726e0
> > > [=C2=A0 222.659665] DR0: 0000000000000000 DR1: 0000000000000= 000 DR2:
> > > 0000000000000000
> > > [=C2=A0 222.666850] DR3: 0000000000000000 DR6: 00000000fffe0= ff0 DR7:
> > > 0000000000000400
> > > [=C2=A0 222.674031] PKRU: 55555554
> > > [=C2=A0 222.676758] Kernel panic - not syncing: Fatal except= ion
> > > [=C2=A0 222.817538] Kernel Offset: 0x16000000 from 0xfffffff= f81000000
> > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > > [=C2=A0 222.965540] ---[ end Kernel panic - not syncing: Fat= al exception ]---
> > >
> > > On Sun, Jul 11, 2021 at 8:06 AM Igor Raits <igor@gooddata.com> wrote:<= br> > > >
> > > > Hi Hugh,
> > > >
> > > > On Sun, Jul 11, 2021 at 6:17 AM Hugh Dickins <hughd@google.com> wr= ote:
> > > >
> > > >> On Sat, 10 Jul 2021, Igor Raits wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > I've seen one weird bug on 5.12.14 that ha= ppened a couple of times when
> > > >> I
> > > >> > started a bunch of VMs on a server.
> > > >>
> > > >> Would it be possible for you to try the same on a 5= .12.13 kernel?
> > > >> Perhaps by reverting the diff between 5.12.13 and 5= .12.14 temporarily.
> > > >> Enough to form an impression of whether the issue i= s new in 5.12.14.
> > > >>
> > > >
> > > > We've been using 5.12.12 for quite some time (~ a m= onth) and I never saw
> > > > it there.
> > > >
> > > > But I have to admit that I don't really have a repr= oducer. For example, on
> > > > servers where it happened,
> > > > I just rebooted them and panic did not happen anymore (= so I saw it only
> > > > only once,
> > > > only on 2 servers out of 32 that we have on 5.12.14). > > > >
> > > >
> > > >> I ask because 5.12.14 did include several fixes and= cleanups from me
> > > >> to page_vma_mapped_walk(), and that is involved in = inserting and
> > > >> removing pmd migration entries.=C2=A0 I am not awar= e of introducing any
> > > >> bug there, but your report has got me worried.=C2= =A0 If it's happening in
> > > >> 5.12.14 but not in 5.12.13, then I must look again = at my changes.
> > > >>
> > > >> I don't expect Hillf's patch to help at at = all: the pmd_lock()
> > > >> is supposed to be taken by page_vma_mapped_walk(), = before
> > > >> set_pmd_migration_entry() and remove_migration_pmd(= ) are called.
> > > >>
> > > >> Thanks,
> > > >> Hugh
> > > >>
> > > >> >
> > > >> > I've briefly googled this problem but coul= d not find any relevant commit
> > > >> > that would fix this issue.
> > > >> >
> > > >> > Do you have any hint how to debug this further= or know the fix by any
> > > >> > chance?
> > > >> >
> > > >> > Thanks in advance. Stack trace following:
> > > >> >
> > > >> > [=C2=A0 376.876610] ------------[ cut here ]--= ----------
> > > >> > [=C2=A0 376.881274] kernel BUG at include/linu= x/swapops.h:204!
> > > >> > [=C2=A0 376.886455] invalid opcode: 0000 [#1] = SMP NOPTI
> > > >> > [=C2=A0 376.891014] CPU: 40 PID: 11775 Comm: r= pc-worker Tainted: G
> > > >>=C2=A0 =C2=A0E
> > > >> >=C2=A0 =C2=A0 =C2=A05.12.14-1.gdc.el8.x86_64 #1=
> > > >> > [=C2=A0 376.900464] Hardware name: HPE ProLian= t DL380 Gen10/ProLiant DL380
> > > >> > Gen10, BIOS U30 05/24/2021
> > > >> > [=C2=A0 376.909038] RIP: 0010:pmd_migration_en= try_wait+0x132/0x140
> > > >> > [=C2=A0 376.914562] Code: 02 00 00 00 5b 4c 89= c7 5d e9 8a e4 f6 ff 48 81 e2
> > > >> 00
> > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 0= 0 00 75 80 e9 44 ff ff ff
> > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff= ff 0f 1f 44 00 00 41 55 48
> > > >> > [=C2=A0 376.933443] RSP: 0000:ffffb65a5e1cfdc8= EFLAGS: 00010246
> > > >> > [=C2=A0 376.938701] RAX: 0017ffffc0000000 RBX:= ffff908b8ecabaf8 RCX:
> > > >> > ffffffffffffffff
> > > >> > [=C2=A0 376.945878] RDX: 0000000000000000 RSI:= ffff908b8ecabaf8 RDI:
> > > >> > fffff497473b2ae8
> > > >> > [=C2=A0 376.953055] RBP: fffff497473b2ae8 R08:= fffff49747fa8080 R09:
> > > >> > 0000000000000000
> > > >> > [=C2=A0 376.960230] R10: 0000000000000000 R11:= 0000000000000000 R12:
> > > >> > 0000000000000af8
> > > >> > [=C2=A0 376.967407] R13: 0400000000000000 R14:= 0400000000000080 R15:
> > > >> > ffff908bbef7b6a8
> > > >> > [=C2=A0 376.974582] FS:=C2=A0 00007f5bb1f81700= (0000) GS:ffff90e87fd80000(0000)
> > > >> > knlGS:0000000000000000
> > > >> > [=C2=A0 376.982718] CS:=C2=A0 0010 DS: 0000 ES= : 0000 CR0: 0000000080050033
> > > >> > [=C2=A0 376.988497] CR2: 00007f5b2bfffd98 CR3:= 00000001f793e006 CR4:
> > > >> > 00000000007726e0
> > > >> > [=C2=A0 376.995673] DR0: 0000000000000000 DR1:= 0000000000000000 DR2:
> > > >> > 0000000000000000
> > > >> > [=C2=A0 377.002849] DR3: 0000000000000000 DR6:= 00000000fffe0ff0 DR7:
> > > >> > 0000000000000400
> > > >> > [=C2=A0 377.010026] PKRU: 55555554
> > > >> > [=C2=A0 377.012745] Call Trace:
> > > >> > [=C2=A0 377.015207]=C2=A0 __handle_mm_fault+0x= 5ad/0x6e0
> > > >> > [=C2=A0 377.019335]=C2=A0 handle_mm_fault+0xc5= /0x290
> > > >> > [=C2=A0 377.023194]=C2=A0 do_user_addr_fault+0= x1cd/0x740
> > > >> > [=C2=A0 377.027406]=C2=A0 exc_page_fault+0x54/= 0x110
> > > >> > [=C2=A0 377.031182]=C2=A0 ? asm_exc_page_fault= +0x8/0x30
> > > >> > [=C2=A0 377.035307]=C2=A0 asm_exc_page_fault+0= x1e/0x30
> > > >> > [=C2=A0 377.039340] RIP: 0033:0x7f5bb91d6734 > > > >> > [=C2=A0 377.042937] Code: 89 08 48 8b 35 dd 3b= 21 00 4c 8d 0d d6 3b 21 00 31
> > > >> c0
> > > >> > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 4= 0 fd ff ff 49 39 d2 74 22
> > > >> > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21= 00 80 be 50 03 00 00 00 c7
> > > >> > [=C2=A0 377.061820] RSP: 002b:00007f5bb1f7ff58= EFLAGS: 00010206
> > > >> > [=C2=A0 377.067076] RAX: 0000000000000000 RBX:= 0000000000000000 RCX:
> > > >> > 00007f5ba0000020
> > > >> > [=C2=A0 377.074255] RDX: 00007f5b2bfff700 RSI:= 00007f5b2bfff9c0 RDI:
> > > >> > 0000000000000001
> > > >> > [=C2=A0 377.081429] RBP: 0000000000000001 R08:= 0000000000000000 R09:
> > > >> > 00007f5bb93ea2f0
> > > >> > [=C2=A0 377.088606] R10: 00007f5bb1f81700 R11:= 0000000000000202 R12:
> > > >> > 0000000000000001
> > > >> > [=C2=A0 377.095782] R13: 0000000000000006 R14:= 0000000000000cb4 R15:
> > > >> > 00007f5bb1f801f0
> > > >> > [=C2=A0 377.102958] Modules linked in: ebt_arp= (E) nft_meta_bridge(E)
> > > >> > ip6_tables(E) xt_CT(E) nf_log_ipv4(E) nf_log_c= ommon(E) nft_limit(E)
> > > >> > nft_counter(E) xt_LOG(E) xt_limit(E) xt_mac(E)= xt_set(E) xt_multiport(E)
> > > >> > xt_state(E) xt_conntrack(E) xt_comment(E) xt_p= hysdev(E) nft_compat(E)
> > > >> > ip_set_hash_net(E) ip_set(E) vhost_net(E) vhos= t(E) vhost_iotlb(E) tap(E)
> > > >> > tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) ne= tconsole(E) nf_tables(E)
> > > >> > vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnet= link(E) binfmt_misc(E)
> > > >> > iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) = mrp(E) bonding(E) tls(E)
> > > >> > vfat(E) fat(E) dm_service_time(E) dm_multipath= (E) rpcrdma(E) sunrpc(E)
> > > >> > rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_targe= t_mod(E)
> > > >> target_core_mod(E)
> > > >> > ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libisc= si(E)
> > > >> scsi_transport_iscsi(E)
> > > >> > intel_rapl_msr(E) qedr(E) intel_rapl_common(E)= ib_uverbs(E)
> > > >> > isst_if_common(E) ib_core(E) nfit(E) libnvdimm= (E)
> > > >> x86_pkg_temp_thermal(E)
> > > >> > intel_powerclamp(E) coretemp(E) kvm_intel(E) k= vm(E) irqbypass(E)
> > > >> > crct10dif_pclmul(E)
> > > >> > [=C2=A0 377.102999]=C2=A0 crc32_pclmul(E) ghas= h_clmulni_intel(E) rapl(E)
> > > >> > intel_cstate(E) ipmi_ssif(E) acpi_ipmi(E) ipmi= _si(E) mei_me(E)
> > > >> ioatdma(E)
> > > >> > ipmi_devintf(E) dm_mod(E) ses(E) intel_uncore(= E) pcspkr(E) qede(E)
> > > >> > enclosure(E) tg3(E) mei(E) lpc_ich(E) hpilo(E)= hpwdt(E)
> > > >> > intel_pch_thermal(E) dca(E) ipmi_msghandler(E)= acpi_power_meter(E)
> > > >> ext4(E)
> > > >> > mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) q= edf(E) qed(E) crc8(E)
> > > >> > libfcoe(E) libfc(E) smartpqi(E) scsi_transport= _fc(E)
> > > >> scsi_transport_sas(E)
> > > >> > wmi(E) nf_conntrack(E) nf_defrag_ipv6(E) libcr= c32c(E) crc32c_intel(E)
> > > >> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) st= p(E) llc(E)
> > > >> > [=C2=A0 377.243468] ---[ end trace 04bce3bb051= f7620 ]---
> > > >> > [=C2=A0 377.385645] RIP: 0010:pmd_migration_en= try_wait+0x132/0x140
> > > >> > [=C2=A0 377.391194] Code: 02 00 00 00 5b 4c 89= c7 5d e9 8a e4 f6 ff 48 81 e2
> > > >> 00
> > > >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 0= 0 00 75 80 e9 44 ff ff ff
> > > >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff= ff 0f 1f 44 00 00 41 55 48
> > > >> > [=C2=A0 377.410091] RSP: 0000:ffffb65a5e1cfdc8= EFLAGS: 00010246
> > > >> > [=C2=A0 377.415355] RAX: 0017ffffc0000000 RBX:= ffff908b8ecabaf8 RCX:
> > > >> > ffffffffffffffff
> > > >> > [=C2=A0 377.422540] RDX: 0000000000000000 RSI:= ffff908b8ecabaf8 RDI:
> > > >> > fffff497473b2ae8
> > > >> > [=C2=A0 377.429721] RBP: fffff497473b2ae8 R08:= fffff49747fa8080 R09:
> > > >> > 0000000000000000
> > > >> > [=C2=A0 377.436902] R10: 0000000000000000 R11:= 0000000000000000 R12:
> > > >> > 0000000000000af8
> > > >> > [=C2=A0 377.444086] R13: 0400000000000000 R14:= 0400000000000080 R15:
> > > >> > ffff908bbef7b6a8
> > > >> > [=C2=A0 377.451272] FS:=C2=A0 00007f5bb1f81700= (0000) GS:ffff90e87fd80000(0000)
> > > >> > knlGS:0000000000000000
> > > >> > [=C2=A0 377.459415] CS:=C2=A0 0010 DS: 0000 ES= : 0000 CR0: 0000000080050033
> > > >> > [=C2=A0 377.465196] CR2: 00007f5b2bfffd98 CR3:= 00000001f793e006 CR4:
> > > >> > 00000000007726e0
> > > >> > [=C2=A0 377.472377] DR0: 0000000000000000 DR1:= 0000000000000000 DR2:
> > > >> > 0000000000000000
> > > >> > [=C2=A0 377.479556] DR3: 0000000000000000 DR6:= 00000000fffe0ff0 DR7:
> > > >> > 0000000000000400
> > > >> > [=C2=A0 377.486738] PKRU: 55555554
> > > >> > [=C2=A0 377.489465] Kernel panic - not syncing= : Fatal exception
> > > >> > [=C2=A0 377.573911] Kernel Offset: 0xa000000 f= rom 0xffffffff81000000
> > > >> (relocation
> > > >> > range: 0xffffffff80000000-0xffffffffbfffffff)<= br> > > > >> > [=C2=A0 377.716482] ---[ end Kernel panic - no= t syncing: Fatal exception ]---
>
> Disassembly of the vmlinux Igor sent (along with other info) confirmed=
> something I suspected, that R08: fffff49747fa8080 in one of the dumps,=
> R08: ffffdf57428d8080 in the other, is the relevant struct page pointe= r
> (and RAX the page->flags, which look like it was pointing at a good= page).
>
> A page pointer ....8080 in pmd_migration_entry_wait() is interesting:<= br> > normally I'd expect that to be ....0000 or ....8000, pointing to t= he
> head of a huge page.=C2=A0 But instead it's pointing to the second= tail
> (though by now that compound page has been freed, and head pointers in=
> the tails reset to 0): as if the pfn has been incremented by 2 somehow= .
>
> And if the pfn (swp_offset) in the migration entry has got corrupted,<= br> > then it's no surprise that when removing migration entries,
> page_vma_mapped_walk() would see migration_entry_to_page(entry) !=3D p= age,
> so be unable to replace that migration entry, leaving it behind for th= e
> user to hit BUG_ON(!PageLocked) in pmd_migration_entry_wait() when
> faulting on it later.
>
> So, what might increment the swp_offset by 2? Hunt around the encoding= s.
> Hmm, _PAGE_BIT_UFFD_WP is _PAGE_BIT_SOFTW2 which is bit 10, whereas > _PAGE_BIT_PROTNONE (top bit to be avoided in pte encoding of swap)
> is _PAGE_BIT_GLOBAL is bit 8. After overcoming off-by-one confusions,<= br> > it looks like if something somewhere were to set _PAGE_BIT_UFFD_WP
> in a migration pmd (whereas it's only suitable for a present pmd),=
> it would indeed increment the swp_offset by 2.
>
> Hunt for uffd_wps, and run across copy_huge_pmd() in mm/huge_memory.c:=
> in Igor's 5.13.1 and 5.12.14 and many others, that says
>=C2=A0 =C2=A0 =C2=A0 =C2=A0if (!(vma->vm_flags & VM_UFFD_WP)) >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0pmd =3D pmd_clea= r_uffd_wp(pmd);
> just *before* checking is_swap_pmd(). Fixed in 5.14-rc1 in commit
> 8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork= ()").
>
> But clearing the bit would be harmless, wouldn't it? Because it wo= uldn't
> be set anyway. Waste a day before remembering what I never forgot but<= br> > somehow blanked out: the L1TF "feature" forced us to invert = the offset
> bits in the pte encoding of a swap entry, so there really is a bit set=
> there in the pmd entry, and clearing it has the effect of setting it i= n
> the corresponding swap entry, so incrementing the migration pfn by 2.<= br> >
> I cannot explain why Igor never saw this crash on 5.12.12: maybe
> something else in the environment changed around that time.=C2=A0 And = it
> will take several days for it to be confirmed as the fix in practice.<= br> >
> But I'm confident that 8f34f1eac382 will prove to be the fix, so P= eter
> please prepare some backports of that for the various stable/longterm<= br> > kernels that need it - I've not looked into whether it applies cle= anly,
> or depends on other commits too.=C2=A0 You fixed several related but d= ifferent
> things in that commit: but this one is the worst, because it can corru= pt
> even those who are not using UFFD_WP at all.

Looks right to me, b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP = properly
when fork", 2020-04-07) seems to be the culprit.=C2=A0 I didn't no= tice the side
effect in the bug or in the fix, or it should have already land stables. I = am
very sorry for such a preliminary bug that caused this fallout - I really c= an't
tell why I completely didn't look at is_swap_pte() that's so obviou= s indeed.

I checked it up, 5.6.y doesn't have the issue commit yet as it's no= t marked as
"fixes". It started to show up in 5.7.y~5.13.y. 5.14-rc1 has 8f34= f1eac382 which
is the fix.=C2=A0 So I think we need the fix or equivalent fix for 5.7.y~5.= 13.y.

5.12.y & 5.13.y can pick up the fix 8f34f1eac382 cleanly.=C2=A0 For the= olders
(5.7.y~5.11.y) they can't.=C2=A0 I plan to revert b569a1760782 instead.=

FTR, even though 8f34f1eac382 applies = cleanly it does not compile.
The 1st patch of that series is also= required (5fc7a5f6fd04) - it removes use of
*vma, which is later= removed by the patch that fixes the actual problem.
=C2=A0

>
> Many thans for reporting and helping, Igor.
> Hugh
>
> p.s. Peter, unrelated to this particular bug, and should not divert fr= om
> fixing it: but looking again at those swap encodings, and particularly=
> the soft_dirty manipulations: they look very fragile. I think uffd_wp<= br> > was wrong to follow that bad example, and your upcoming new encoding > (that I have previously called elegant) takes it a worse step further.=
>
> I think we should change to a rule where the architecture-independent<= br> > swp_entry_t contains *all* the info, including bits for soft_dirty and=
> uffd_wp, so that swap entry cases can move immediately to decoding fro= m
> arch-dependent pte to arch-independent swp_entry_t, and do all the
> manipulations on that. But I don't have time to make that change, = and
> probably neither do you, and making the change is liable to introduce<= br> > errors itself. So, no immediate plans, but please keep in mind.

Curious: did we encounter similar issue previously where soft dirty bit is<= br> applied wrongly so causing hard-to-debug issues?

If this is destined to be the best solution, I can work on both of them.=C2= =A0 I am
just worried that's too big a change as you said so we don't know w= hat's the
most efficient considering total time we use to develop, review and debug t= hem.

The other alternative is we fix bugs; I know that's so cheap a word whe= n I said
it, however we still can't deny it as an option yet.

We can definitely discuss this out of this thread and I'll prepare the = backport
first.=C2=A0 For all the cases, this bug definitely brings some alert, and = I'll keep
that in mind.

Please let me know if there's any comment on the backport plan above, o= r I'll
prepare the patches for all the branches before tomorrow.

Thanks,

--
Peter Xu



--

Igor Raits

Sr. SW Engineer

igor@gooddata.com

+420 775 = 117 817


Moravske namesti 1007/14

602 00 Brno-Ve= veri, Czech Republic

Twitter= | Facebook | LinkedIn | Blog


--00000000000013642705c7894307--