From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_KAM_HTML_FONT_INVALID autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BFCDEC636C7 for ; Thu, 15 Jul 2021 17:47:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1E6E3613C4 for ; Thu, 15 Jul 2021 17:47:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1E6E3613C4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gooddata.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6F61A8D00E8; Thu, 15 Jul 2021 13:47:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6A67C8D00CD; Thu, 15 Jul 2021 13:47:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F9948D00E8; Thu, 15 Jul 2021 13:47:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0167.hostedemail.com [216.40.44.167]) by kanga.kvack.org (Postfix) with ESMTP id 1D4118D00CD for ; Thu, 15 Jul 2021 13:47:21 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E5BF523E60 for ; Thu, 15 Jul 2021 17:47:19 +0000 (UTC) X-FDA: 78365553798.25.0BB2165 Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by imf20.hostedemail.com (Postfix) with ESMTP id 6EB74D0000B1 for ; Thu, 15 Jul 2021 17:47:19 +0000 (UTC) Received: by mail-ed1-f47.google.com with SMTP id dj21so9410046edb.0 for ; Thu, 15 Jul 2021 10:47:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gooddata.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=eoaKBJeZT6C5CEIb7uiB6kYC0BC0qol9kuGxoR7UYhc=; b=PlXHZk7PZN7eYekaAgR4o3a4h5VVAjjPYAJJLcbx9vAVubA3v/Y6LA6Wtd1eSLSVsR EmxTrvn/5lgglIDa4QzgeIDBPKNQuYAVAEmR6FmRi3h/cGk8z1QfyzlKBuntlUjGHF5S 2WPR1QHkcg+wbR5r+SbrlVHVWHmLrjGr7P2WQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=eoaKBJeZT6C5CEIb7uiB6kYC0BC0qol9kuGxoR7UYhc=; b=B2E2hCFpBVs1XVGjueaNkltcEMLw89OHw2pqcFSUT+/E5lOAgC6W8UJfqUWrvw3O5B 74+/+8bZrSWld52vez4gczcH7nl/8Gchs5gJo6eDRuVdiXo9eLtvlEBYB5AEi9Vm7VGg Lajc7EbJzMOtKkjnNd2UrOPKTRDrJo2P34B7wmasHLJlCDNb/diFnn64ANFakUv7UXz9 iMXDRRrpJcpOiWxUZZVuEYOmpegUfu0ycV7d41XHfe1hycszI7bIOx9GsolzSYPudbXz 3cRqDILw4NODFyGvJedv9rF0/IaCJuC70MtJ/to/25vlqXMPVVy7ObeGY4qogPRY4cV4 LD9w== X-Gm-Message-State: AOAM531LEvEN6+nyZcpHUvWjtUr4Z6TQXotpA2k3L2swUCUS/uK8c73N KCWw3MafX+9TUt/6ChxV4sjjdIr01j+kRNUrLaERng== X-Google-Smtp-Source: ABdhPJz6yMPEofNsZ1KAFoRzGeQAnO/J9x2opW8SOW91N0S+3V4m/lZhT2qIaSzftJPrE9ByYVZE/sTlKYc/Ajm8ZV0= X-Received: by 2002:a05:6402:168f:: with SMTP id a15mr8784893edv.3.1626371238022; Thu, 15 Jul 2021 10:47:18 -0700 (PDT) MIME-Version: 1.0 References: <4c9e24db-29d5-5bbb-17ae-8dc32ceb66ed@google.com> In-Reply-To: From: Igor Raits Date: Thu, 15 Jul 2021 19:47:06 +0200 Message-ID: Subject: Re: kernel BUG at include/linux/swapops.h:204! To: Hugh Dickins Cc: linux-mm@kvack.org, Andrew Morton , Hillf Danton Content-Type: multipart/alternative; boundary="000000000000ea698a05c72d0f9e" Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gooddata.com header.s=google header.b=PlXHZk7P; spf=pass (imf20.hostedemail.com: domain of igor.raits@gooddata.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=igor.raits@gooddata.com; dmarc=pass (policy=none) header.from=gooddata.com X-Rspamd-Server: rspam05 X-Stat-Signature: uee31z7shj5n9fd7qbwzarxyf4e3zdxz X-Rspamd-Queue-Id: 6EB74D0000B1 X-HE-Tag: 1626371239-521863 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --000000000000ea698a05c72d0f9e Content-Type: text/plain; charset="UTF-8" Hi everyone again, I've been trying to reproduce this issue but still can't find a consistent pattern. However, it did happen once more and this time on 5.13.1: [ 222.068216] ------------[ cut here ]------------ [ 222.072884] kernel BUG at include/linux/swapops.h:204! [ 222.078062] invalid opcode: 0000 [#1] SMP NOPTI [ 222.082618] CPU: 38 PID: 9828 Comm: rpc-worker Tainted: G E 5.13.1-1.gdc.el8.x86_64 #1 [ 222.091894] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 05/24/2021 [ 222.100468] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 [ 222.105994] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 e2 00 f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48 [ 222.124878] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246 [ 222.130134] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX: ffffffffffffffff [ 222.137309] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI: ffffdf55c52cf368 [ 222.144485] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09: 0000000000000000 [ 222.151661] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000bf8 [ 222.158837] R13: 0400000000000000 R14: 0400000000000080 R15: ffff9eec2825b1f8 [ 222.166015] FS: 00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000) knlGS:0000000000000000 [ 222.174153] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 222.179932] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4: 00000000007726e0 [ 222.187109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 222.194283] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 222.201457] PKRU: 55555554 [ 222.204178] Call Trace: [ 222.206638] __handle_mm_fault+0x5ad/0x6e0 [ 222.210760] ? sysvec_call_function_single+0xb/0x90 [ 222.215672] handle_mm_fault+0xc5/0x290 [ 222.219529] do_user_addr_fault+0x1a9/0x660 [ 222.223740] ? sched_clock_cpu+0xc/0xa0 [ 222.227602] exc_page_fault+0x68/0x130 [ 222.231373] ? asm_exc_page_fault+0x8/0x30 [ 222.235495] asm_exc_page_fault+0x1e/0x30 [ 222.239526] RIP: 0033:0x7f67baaed734 [ 222.243120] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00 31 c0 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 22 <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7 [ 222.262002] RSP: 002b:00007f6754aea298 EFLAGS: 00010287 [ 222.267257] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 222.274432] RDX: 00007f676ffff700 RSI: 00007f676ffff9c0 RDI: 00007f676f7fec10 [ 222.281609] RBP: 0000000000000001 R08: 00007f676f7fed10 R09: 00007f67bad012f0 [ 222.288785] R10: 00007f6754aeb700 R11: 0000000000000202 R12: 0000000000000001 [ 222.295961] R13: 0000000000000006 R14: 0000000000000e28 R15: 00007f674006e1f0 [ 222.303137] Modules linked in: vhost_net(E) vhost(E) vhost_iotlb(E) tap(E) tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) nf_tables(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) binfmt_misc(E) iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E) tls(E) vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E) intel_rapl_msr(E) intel_rapl_common(E) scsi_transport_iscsi(E) isst_if_common(E) ipmi_ssif(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) qedr(E) mei_me(E) acpi_ipmi(E) ib_uverbs(E) intel_cstate(E) ipmi_si(E) ib_core(E) ipmi_devintf(E) dm_mod(E) ioatdma(E) ses(E) intel_uncore(E) pcspkr(E) enclosure(E) mei(E) hpwdt(E) hpilo(E) lpc_ich(E) intel_pch_thermal(E) dca(E) ipmi_msghandler(E) [ 222.303181] acpi_power_meter(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf(E) qede(E) libfcoe(E) qed(E) libfc(E) smartpqi(E) scsi_transport_fc(E) tg3(E) scsi_transport_sas(E) crc8(E) wmi(E) nf_conntrack(E) libcrc32c(E) crc32c_intel(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) [ 222.420050] ---[ end trace bcf7b6d1610cc21f ]--- [ 222.572925] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 [ 222.578469] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 e2 00 f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48 [ 222.597359] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246 [ 222.602620] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX: ffffffffffffffff [ 222.609807] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI: ffffdf55c52cf368 [ 222.616990] RBP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09: 0000000000000000 [ 222.624177] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000bf8 [ 222.631361] R13: 0400000000000000 R14: 0400000000000080 R15: ffff9eec2825b1f8 [ 222.638548] FS: 00007f6754aeb700(0000) GS:ffff9f49bfd00000(0000) knlGS:0000000000000000 [ 222.646694] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 222.652481] CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4: 00000000007726e0 [ 222.659665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 222.666850] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 222.674031] PKRU: 55555554 [ 222.676758] Kernel panic - not syncing: Fatal exception [ 222.817538] Kernel Offset: 0x16000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 222.965540] ---[ end Kernel panic - not syncing: Fatal exception ]--- On Sun, Jul 11, 2021 at 8:06 AM Igor Raits wrote: > Hi Hugh, > > On Sun, Jul 11, 2021 at 6:17 AM Hugh Dickins wrote: > >> On Sat, 10 Jul 2021, Igor Raits wrote: >> >> > Hello, >> > >> > I've seen one weird bug on 5.12.14 that happened a couple of times when >> I >> > started a bunch of VMs on a server. >> >> Would it be possible for you to try the same on a 5.12.13 kernel? >> Perhaps by reverting the diff between 5.12.13 and 5.12.14 temporarily. >> Enough to form an impression of whether the issue is new in 5.12.14. >> > > We've been using 5.12.12 for quite some time (~ a month) and I never saw > it there. > > But I have to admit that I don't really have a reproducer. For example, on > servers where it happened, > I just rebooted them and panic did not happen anymore (so I saw it only > only once, > only on 2 servers out of 32 that we have on 5.12.14). > > >> I ask because 5.12.14 did include several fixes and cleanups from me >> to page_vma_mapped_walk(), and that is involved in inserting and >> removing pmd migration entries. I am not aware of introducing any >> bug there, but your report has got me worried. If it's happening in >> 5.12.14 but not in 5.12.13, then I must look again at my changes. >> >> I don't expect Hillf's patch to help at at all: the pmd_lock() >> is supposed to be taken by page_vma_mapped_walk(), before >> set_pmd_migration_entry() and remove_migration_pmd() are called. >> >> Thanks, >> Hugh >> >> > >> > I've briefly googled this problem but could not find any relevant commit >> > that would fix this issue. >> > >> > Do you have any hint how to debug this further or know the fix by any >> > chance? >> > >> > Thanks in advance. Stack trace following: >> > >> > [ 376.876610] ------------[ cut here ]------------ >> > [ 376.881274] kernel BUG at include/linux/swapops.h:204! >> > [ 376.886455] invalid opcode: 0000 [#1] SMP NOPTI >> > [ 376.891014] CPU: 40 PID: 11775 Comm: rpc-worker Tainted: G >> E >> > 5.12.14-1.gdc.el8.x86_64 #1 >> > [ 376.900464] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 >> > Gen10, BIOS U30 05/24/2021 >> > [ 376.909038] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 >> > [ 376.914562] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff 48 81 e2 >> 00 >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48 >> > [ 376.933443] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246 >> > [ 376.938701] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX: >> > ffffffffffffffff >> > [ 376.945878] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI: >> > fffff497473b2ae8 >> > [ 376.953055] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09: >> > 0000000000000000 >> > [ 376.960230] R10: 0000000000000000 R11: 0000000000000000 R12: >> > 0000000000000af8 >> > [ 376.967407] R13: 0400000000000000 R14: 0400000000000080 R15: >> > ffff908bbef7b6a8 >> > [ 376.974582] FS: 00007f5bb1f81700(0000) GS:ffff90e87fd80000(0000) >> > knlGS:0000000000000000 >> > [ 376.982718] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> > [ 376.988497] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4: >> > 00000000007726e0 >> > [ 376.995673] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> > 0000000000000000 >> > [ 377.002849] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >> > 0000000000000400 >> > [ 377.010026] PKRU: 55555554 >> > [ 377.012745] Call Trace: >> > [ 377.015207] __handle_mm_fault+0x5ad/0x6e0 >> > [ 377.019335] handle_mm_fault+0xc5/0x290 >> > [ 377.023194] do_user_addr_fault+0x1cd/0x740 >> > [ 377.027406] exc_page_fault+0x54/0x110 >> > [ 377.031182] ? asm_exc_page_fault+0x8/0x30 >> > [ 377.035307] asm_exc_page_fault+0x1e/0x30 >> > [ 377.039340] RIP: 0033:0x7f5bb91d6734 >> > [ 377.042937] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21 00 31 >> c0 >> > 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 22 >> > <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00 c7 >> > [ 377.061820] RSP: 002b:00007f5bb1f7ff58 EFLAGS: 00010206 >> > [ 377.067076] RAX: 0000000000000000 RBX: 0000000000000000 RCX: >> > 00007f5ba0000020 >> > [ 377.074255] RDX: 00007f5b2bfff700 RSI: 00007f5b2bfff9c0 RDI: >> > 0000000000000001 >> > [ 377.081429] RBP: 0000000000000001 R08: 0000000000000000 R09: >> > 00007f5bb93ea2f0 >> > [ 377.088606] R10: 00007f5bb1f81700 R11: 0000000000000202 R12: >> > 0000000000000001 >> > [ 377.095782] R13: 0000000000000006 R14: 0000000000000cb4 R15: >> > 00007f5bb1f801f0 >> > [ 377.102958] Modules linked in: ebt_arp(E) nft_meta_bridge(E) >> > ip6_tables(E) xt_CT(E) nf_log_ipv4(E) nf_log_common(E) nft_limit(E) >> > nft_counter(E) xt_LOG(E) xt_limit(E) xt_mac(E) xt_set(E) xt_multiport(E) >> > xt_state(E) xt_conntrack(E) xt_comment(E) xt_physdev(E) nft_compat(E) >> > ip_set_hash_net(E) ip_set(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E) >> > tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) nf_tables(E) >> > vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) binfmt_misc(E) >> > iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E) tls(E) >> > vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E) sunrpc(E) >> > rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) >> target_core_mod(E) >> > ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E) >> scsi_transport_iscsi(E) >> > intel_rapl_msr(E) qedr(E) intel_rapl_common(E) ib_uverbs(E) >> > isst_if_common(E) ib_core(E) nfit(E) libnvdimm(E) >> x86_pkg_temp_thermal(E) >> > intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) >> > crct10dif_pclmul(E) >> > [ 377.102999] crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) >> > intel_cstate(E) ipmi_ssif(E) acpi_ipmi(E) ipmi_si(E) mei_me(E) >> ioatdma(E) >> > ipmi_devintf(E) dm_mod(E) ses(E) intel_uncore(E) pcspkr(E) qede(E) >> > enclosure(E) tg3(E) mei(E) lpc_ich(E) hpilo(E) hpwdt(E) >> > intel_pch_thermal(E) dca(E) ipmi_msghandler(E) acpi_power_meter(E) >> ext4(E) >> > mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf(E) qed(E) crc8(E) >> > libfcoe(E) libfc(E) smartpqi(E) scsi_transport_fc(E) >> scsi_transport_sas(E) >> > wmi(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) crc32c_intel(E) >> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E) >> > [ 377.243468] ---[ end trace 04bce3bb051f7620 ]--- >> > [ 377.385645] RIP: 0010:pmd_migration_entry_wait+0x132/0x140 >> > [ 377.391194] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff 48 81 e2 >> 00 >> > f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff >> > <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48 >> > [ 377.410091] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246 >> > [ 377.415355] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX: >> > ffffffffffffffff >> > [ 377.422540] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI: >> > fffff497473b2ae8 >> > [ 377.429721] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09: >> > 0000000000000000 >> > [ 377.436902] R10: 0000000000000000 R11: 0000000000000000 R12: >> > 0000000000000af8 >> > [ 377.444086] R13: 0400000000000000 R14: 0400000000000080 R15: >> > ffff908bbef7b6a8 >> > [ 377.451272] FS: 00007f5bb1f81700(0000) GS:ffff90e87fd80000(0000) >> > knlGS:0000000000000000 >> > [ 377.459415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> > [ 377.465196] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4: >> > 00000000007726e0 >> > [ 377.472377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> > 0000000000000000 >> > [ 377.479556] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >> > 0000000000000400 >> > [ 377.486738] PKRU: 55555554 >> > [ 377.489465] Kernel panic - not syncing: Fatal exception >> > [ 377.573911] Kernel Offset: 0xa000000 from 0xffffffff81000000 >> (relocation >> > range: 0xffffffff80000000-0xffffffffbfffffff) >> > [ 377.716482] ---[ end Kernel panic - not syncing: Fatal exception ]--- >> > >> > > > -- > > Igor Raits > > Sr. SW Engineer > > igor@gooddata.com > > +420 775 117 817 > > Moravske namesti 1007/14 > > 602 00 Brno-Veveri, Czech Republic > > Twitter | Facebook > | LinkedIn > | Blog > > > > > -- Igor Raits Sr. SW Engineer igor@gooddata.com +420 775 117 817 Moravske namesti 1007/14 602 00 Brno-Veveri, Czech Republic Twitter | Facebook | LinkedIn | Blog --000000000000ea698a05c72d0f9e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi everyone again,

I've = been trying to reproduce this issue but still can't find a consistent p= attern.

However, it did happen once more and this = time on 5.13.1:

[ =C2=A0222.068216] ------------[ = cut here ]------------
[ =C2=A0222.072884] kernel BUG at include/linux/s= wapops.h:204!
[ =C2=A0222.078062] invalid opcode: 0000 [#1] SMP NOPTI[ =C2=A0222.082618] CPU: 38 PID: 9828 Comm: rpc-worker Tainted: G =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0E =C2=A0 =C2=A0 5.13.1-1.gdc.el8.x86_64 #= 1
[ =C2=A0222.091894] Hardware name: HPE ProLiant DL380 Gen10/ProLiant D= L380 Gen10, BIOS U30 05/24/2021
[ =C2=A0222.100468] RIP: 0010:pmd_migrat= ion_entry_wait+0x132/0x140
[ =C2=A0222.105994] Code: 02 00 00 00 5b 4c 8= 9 c7 5d e9 ca c5 f6 ff 48 81 e2 00 f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 8= 1 01 00 00 75 80 e9 44 ff ff ff <0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe= ff ff 0f 1f 44 00 00 41 55 48
[ =C2=A0222.124878] RSP: 0000:ffffbcfe9eb= 7bdd8 EFLAGS: 00010246
[ =C2=A0222.130134] RAX: 0057ffffc0000000 RBX: ff= ff9eec4b3cdbf8 RCX: ffffffffffffffff
[ =C2=A0222.137309] RDX: 0000000000= 000000 RSI: ffff9eec4b3cdbf8 RDI: ffffdf55c52cf368
[ =C2=A0222.144485] R= BP: ffffdf55c52cf368 R08: ffffdf57428d8080 R09: 0000000000000000
[ =C2= =A0222.151661] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000= 00bf8
[ =C2=A0222.158837] R13: 0400000000000000 R14: 0400000000000080 R1= 5: ffff9eec2825b1f8
[ =C2=A0222.166015] FS: =C2=A000007f6754aeb700(0000)= GS:ffff9f49bfd00000(0000) knlGS:0000000000000000
[ =C2=A0222.174153] CS= : =C2=A00010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ =C2=A0222.179932]= CR2: 00007f676ffffd98 CR3: 000000012bf6a002 CR4: 00000000007726e0
[ =C2= =A0222.187109] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000= 00000
[ =C2=A0222.194283] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR= 7: 0000000000000400
[ =C2=A0222.201457] PKRU: 55555554
[ =C2=A0222.20= 4178] Call Trace:
[ =C2=A0222.206638] =C2=A0__handle_mm_fault+0x5ad/0x6e= 0
[ =C2=A0222.210760] =C2=A0? sysvec_call_function_single+0xb/0x90
[ = =C2=A0222.215672] =C2=A0handle_mm_fault+0xc5/0x290
[ =C2=A0222.219529] = =C2=A0do_user_addr_fault+0x1a9/0x660
[ =C2=A0222.223740] =C2=A0? sched_c= lock_cpu+0xc/0xa0
[ =C2=A0222.227602] =C2=A0exc_page_fault+0x68/0x130[ =C2=A0222.231373] =C2=A0? asm_exc_page_fault+0x8/0x30
[ =C2=A0222.235= 495] =C2=A0asm_exc_page_fault+0x1e/0x30
[ =C2=A0222.239526] RIP: 0033:0x= 7f67baaed734
[ =C2=A0222.243120] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d = 0d d6 3b 21 00 31 c0 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff = ff 49 39 d2 74 22 <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 5= 0 03 00 00 00 c7
[ =C2=A0222.262002] RSP: 002b:00007f6754aea298 EFLAGS: = 00010287
[ =C2=A0222.267257] RAX: 0000000000000000 RBX: 0000000000000000= RCX: 0000000000000000
[ =C2=A0222.274432] RDX: 00007f676ffff700 RSI: 00= 007f676ffff9c0 RDI: 00007f676f7fec10
[ =C2=A0222.281609] RBP: 0000000000= 000001 R08: 00007f676f7fed10 R09: 00007f67bad012f0
[ =C2=A0222.288785] R= 10: 00007f6754aeb700 R11: 0000000000000202 R12: 0000000000000001
[ =C2= =A0222.295961] R13: 0000000000000006 R14: 0000000000000e28 R15: 00007f67400= 6e1f0
[ =C2=A0222.303137] Modules linked in: vhost_net(E) vhost(E) vhost= _iotlb(E) tap(E) tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) = nf_tables(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) binfmt_m= isc(E) iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E) tls(= E) vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E) sunrpc(E) r= dma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) target_core_mod(E) ib= _iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E) intel_rapl_msr(E) intel_r= apl_common(E) scsi_transport_iscsi(E) isst_if_common(E) ipmi_ssif(E) nfit(E= ) libnvdimm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_= intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmu= lni_intel(E) rapl(E) qedr(E) mei_me(E) acpi_ipmi(E) ib_uverbs(E) intel_csta= te(E) ipmi_si(E) ib_core(E) ipmi_devintf(E) dm_mod(E) ioatdma(E) ses(E) int= el_uncore(E) pcspkr(E) enclosure(E) mei(E) hpwdt(E) hpilo(E) lpc_ich(E) int= el_pch_thermal(E) dca(E) ipmi_msghandler(E)
[ =C2=A0222.303181] =C2=A0ac= pi_power_meter(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf= (E) qede(E) libfcoe(E) qed(E) libfc(E) smartpqi(E) scsi_transport_fc(E) tg3= (E) scsi_transport_sas(E) crc8(E) wmi(E) nf_conntrack(E) libcrc32c(E) crc32= c_intel(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) br_netfilter(E) bridge(E) st= p(E) llc(E)
[ =C2=A0222.420050] ---[ end trace bcf7b6d1610cc21f ]---
= [ =C2=A0222.572925] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
[ =C2= =A0222.578469] Code: 02 00 00 00 5b 4c 89 c7 5d e9 ca c5 f6 ff 48 81 e2 00 = f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff ff <= ;0f> 0b 48 8b 2d 65 48 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55 48
[= =C2=A0222.597359] RSP: 0000:ffffbcfe9eb7bdd8 EFLAGS: 00010246
[ =C2=A02= 22.602620] RAX: 0057ffffc0000000 RBX: ffff9eec4b3cdbf8 RCX: fffffffffffffff= f
[ =C2=A0222.609807] RDX: 0000000000000000 RSI: ffff9eec4b3cdbf8 RDI: f= fffdf55c52cf368
[ =C2=A0222.616990] RBP: ffffdf55c52cf368 R08: ffffdf574= 28d8080 R09: 0000000000000000
[ =C2=A0222.624177] R10: 0000000000000000 = R11: 0000000000000000 R12: 0000000000000bf8
[ =C2=A0222.631361] R13: 040= 0000000000000 R14: 0400000000000080 R15: ffff9eec2825b1f8
[ =C2=A0222.63= 8548] FS: =C2=A000007f6754aeb700(0000) GS:ffff9f49bfd00000(0000) knlGS:0000= 000000000000
[ =C2=A0222.646694] CS: =C2=A00010 DS: 0000 ES: 0000 CR0: 0= 000000080050033
[ =C2=A0222.652481] CR2: 00007f676ffffd98 CR3: 000000012= bf6a002 CR4: 00000000007726e0
[ =C2=A0222.659665] DR0: 0000000000000000 = DR1: 0000000000000000 DR2: 0000000000000000
[ =C2=A0222.666850] DR3: 000= 0000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ =C2=A0222.67= 4031] PKRU: 55555554
[ =C2=A0222.676758] Kernel panic - not syncing: Fat= al exception
[ =C2=A0222.817538] Kernel Offset: 0x16000000 from 0xffffff= ff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ = =C2=A0222.965540] ---[ end Kernel panic - not syncing: Fatal exception ]---=

On Sun, Jul 11, 2021 at 8:06 AM Igor Raits <igor@gooddata.com> wrote:
Hi Hugh,

=
On Sun, Jul 11, 2021 at 6:17 AM= Hugh Dickins <hug= hd@google.com> wrote:
On Sat, 10 Jul 2021, Igor Raits wrote:

> Hello,
>
> I've seen one weird bug on 5.12.14 that happened a couple of times= when I
> started a bunch of VMs on a server.

Would it be possible for you to try the same on a 5.12.13 kernel?
Perhaps by reverting the diff between 5.12.13 and 5.12.14 temporarily.
Enough to form an impression of whether the issue is new in 5.12.14.

We've been using 5.12.12 for quite some t= ime (~ a month) and I never saw it there.

But I ha= ve to admit that I don't really have a reproducer. For example, on serv= ers where it happened,
I just rebooted them and panic did not hap= pen anymore (so I saw it only only once,
only on 2 servers out of= 32 that we have on 5.12.14).


I ask because 5.12.14 did include several fixes and cleanups from me
to page_vma_mapped_walk(), and that is involved in inserting and
removing pmd migration entries.=C2=A0 I am not aware of introducing any
bug there, but your report has got me worried.=C2=A0 If it's happening = in
5.12.14 but not in 5.12.13, then I must look again at my changes.

I don't expect Hillf's patch to help at at all: the pmd_lock()
is supposed to be taken by page_vma_mapped_walk(), before
set_pmd_migration_entry() and remove_migration_pmd() are called.

Thanks,
Hugh

>
> I've briefly googled this problem but could not find any relevant = commit
> that would fix this issue.
>
> Do you have any hint how to debug this further or know the fix by any<= br> > chance?
>
> Thanks in advance. Stack trace following:
>
> [=C2=A0 376.876610] ------------[ cut here ]------------
> [=C2=A0 376.881274] kernel BUG at include/linux/swapops.h:204!
> [=C2=A0 376.886455] invalid opcode: 0000 [#1] SMP NOPTI
> [=C2=A0 376.891014] CPU: 40 PID: 11775 Comm: rpc-worker Tainted: G=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 E
>=C2=A0 =C2=A0 =C2=A05.12.14-1.gdc.el8.x86_64 #1
> [=C2=A0 376.900464] Hardware name: HPE ProLiant DL380 Gen10/ProLiant D= L380
> Gen10, BIOS U30 05/24/2021
> [=C2=A0 376.909038] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> [=C2=A0 376.914562] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff 48= 81 e2 00
> f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff f= f
> <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55= 48
> [=C2=A0 376.933443] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246
> [=C2=A0 376.938701] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX: > ffffffffffffffff
> [=C2=A0 376.945878] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI: > fffff497473b2ae8
> [=C2=A0 376.953055] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09: > 0000000000000000
> [=C2=A0 376.960230] R10: 0000000000000000 R11: 0000000000000000 R12: > 0000000000000af8
> [=C2=A0 376.967407] R13: 0400000000000000 R14: 0400000000000080 R15: > ffff908bbef7b6a8
> [=C2=A0 376.974582] FS:=C2=A0 00007f5bb1f81700(0000) GS:ffff90e87fd800= 00(0000)
> knlGS:0000000000000000
> [=C2=A0 376.982718] CS:=C2=A0 0010 DS: 0000 ES: 0000 CR0: 000000008005= 0033
> [=C2=A0 376.988497] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4: > 00000000007726e0
> [=C2=A0 376.995673] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000
> [=C2=A0 377.002849] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400
> [=C2=A0 377.010026] PKRU: 55555554
> [=C2=A0 377.012745] Call Trace:
> [=C2=A0 377.015207]=C2=A0 __handle_mm_fault+0x5ad/0x6e0
> [=C2=A0 377.019335]=C2=A0 handle_mm_fault+0xc5/0x290
> [=C2=A0 377.023194]=C2=A0 do_user_addr_fault+0x1cd/0x740
> [=C2=A0 377.027406]=C2=A0 exc_page_fault+0x54/0x110
> [=C2=A0 377.031182]=C2=A0 ? asm_exc_page_fault+0x8/0x30
> [=C2=A0 377.035307]=C2=A0 asm_exc_page_fault+0x1e/0x30
> [=C2=A0 377.039340] RIP: 0033:0x7f5bb91d6734
> [=C2=A0 377.042937] Code: 89 08 48 8b 35 dd 3b 21 00 4c 8d 0d d6 3b 21= 00 31 c0
> 4c 39 ce 74 73 0f 1f 80 00 00 00 00 48 8d 96 40 fd ff ff 49 39 d2 74 2= 2
> <48> 8b 96 d8 03 00 00 48 01 15 4e 7c 21 00 80 be 50 03 00 00 00= c7
> [=C2=A0 377.061820] RSP: 002b:00007f5bb1f7ff58 EFLAGS: 00010206
> [=C2=A0 377.067076] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > 00007f5ba0000020
> [=C2=A0 377.074255] RDX: 00007f5b2bfff700 RSI: 00007f5b2bfff9c0 RDI: > 0000000000000001
> [=C2=A0 377.081429] RBP: 0000000000000001 R08: 0000000000000000 R09: > 00007f5bb93ea2f0
> [=C2=A0 377.088606] R10: 00007f5bb1f81700 R11: 0000000000000202 R12: > 0000000000000001
> [=C2=A0 377.095782] R13: 0000000000000006 R14: 0000000000000cb4 R15: > 00007f5bb1f801f0
> [=C2=A0 377.102958] Modules linked in: ebt_arp(E) nft_meta_bridge(E) > ip6_tables(E) xt_CT(E) nf_log_ipv4(E) nf_log_common(E) nft_limit(E) > nft_counter(E) xt_LOG(E) xt_limit(E) xt_mac(E) xt_set(E) xt_multiport(= E)
> xt_state(E) xt_conntrack(E) xt_comment(E) xt_physdev(E) nft_compat(E)<= br> > ip_set_hash_net(E) ip_set(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(= E)
> tun(E) tcp_diag(E) udp_diag(E) inet_diag(E) netconsole(E) nf_tables(E)=
> vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) nfnetlink(E) binfmt_misc(E) > iscsi_tcp(E) libiscsi_tcp(E) 8021q(E) garp(E) mrp(E) bonding(E) tls(E)=
> vfat(E) fat(E) dm_service_time(E) dm_multipath(E) rpcrdma(E) sunrpc(E)=
> rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) target_core_mod= (E)
> ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) libiscsi(E) scsi_transport_isc= si(E)
> intel_rapl_msr(E) qedr(E) intel_rapl_common(E) ib_uverbs(E)
> isst_if_common(E) ib_core(E) nfit(E) libnvdimm(E) x86_pkg_temp_thermal= (E)
> intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E)
> crct10dif_pclmul(E)
> [=C2=A0 377.102999]=C2=A0 crc32_pclmul(E) ghash_clmulni_intel(E) rapl(= E)
> intel_cstate(E) ipmi_ssif(E) acpi_ipmi(E) ipmi_si(E) mei_me(E) ioatdma= (E)
> ipmi_devintf(E) dm_mod(E) ses(E) intel_uncore(E) pcspkr(E) qede(E)
> enclosure(E) tg3(E) mei(E) lpc_ich(E) hpilo(E) hpwdt(E)
> intel_pch_thermal(E) dca(E) ipmi_msghandler(E) acpi_power_meter(E) ext= 4(E)
> mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) qedf(E) qed(E) crc8(E) > libfcoe(E) libfc(E) smartpqi(E) scsi_transport_fc(E) scsi_transport_sa= s(E)
> wmi(E) nf_conntrack(E) nf_defrag_ipv6(E) libcrc32c(E) crc32c_intel(E)<= br> > nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E) llc(E)
> [=C2=A0 377.243468] ---[ end trace 04bce3bb051f7620 ]---
> [=C2=A0 377.385645] RIP: 0010:pmd_migration_entry_wait+0x132/0x140
> [=C2=A0 377.391194] Code: 02 00 00 00 5b 4c 89 c7 5d e9 8a e4 f6 ff 48= 81 e2 00
> f0 ff ff 48 f7 d2 48 21 c2 89 d1 f7 c2 81 01 00 00 75 80 e9 44 ff ff f= f
> <0f> 0b 48 8b 2d 75 bd 30 01 e9 ef fe ff ff 0f 1f 44 00 00 41 55= 48
> [=C2=A0 377.410091] RSP: 0000:ffffb65a5e1cfdc8 EFLAGS: 00010246
> [=C2=A0 377.415355] RAX: 0017ffffc0000000 RBX: ffff908b8ecabaf8 RCX: > ffffffffffffffff
> [=C2=A0 377.422540] RDX: 0000000000000000 RSI: ffff908b8ecabaf8 RDI: > fffff497473b2ae8
> [=C2=A0 377.429721] RBP: fffff497473b2ae8 R08: fffff49747fa8080 R09: > 0000000000000000
> [=C2=A0 377.436902] R10: 0000000000000000 R11: 0000000000000000 R12: > 0000000000000af8
> [=C2=A0 377.444086] R13: 0400000000000000 R14: 0400000000000080 R15: > ffff908bbef7b6a8
> [=C2=A0 377.451272] FS:=C2=A0 00007f5bb1f81700(0000) GS:ffff90e87fd800= 00(0000)
> knlGS:0000000000000000
> [=C2=A0 377.459415] CS:=C2=A0 0010 DS: 0000 ES: 0000 CR0: 000000008005= 0033
> [=C2=A0 377.465196] CR2: 00007f5b2bfffd98 CR3: 00000001f793e006 CR4: > 00000000007726e0
> [=C2=A0 377.472377] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000
> [=C2=A0 377.479556] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400
> [=C2=A0 377.486738] PKRU: 55555554
> [=C2=A0 377.489465] Kernel panic - not syncing: Fatal exception
> [=C2=A0 377.573911] Kernel Offset: 0xa000000 from 0xffffffff81000000 (= relocation
> range: 0xffffffff80000000-0xffffffffbfffffff)
> [=C2=A0 377.716482] ---[ end Kernel panic - not syncing: Fatal excepti= on ]---
>


--

Igor Raits

Sr. SW Engineer

igor@gooddata.com

+420 775 117 817


M= oravske namesti 1007/14

602 00 Brno-Veveri, Czech Republic

Twitter | Facebook | LinkedIn | Blog




--

Igor Raits

Sr. SW Engineer

igor@gooddata.com

+420 775 = 117 817


Moravske namesti 1007/14

602 00 Brno-Ve= veri, Czech Republic

Twitter= | Facebook | LinkedIn | Blog


--000000000000ea698a05c72d0f9e--