From: Tao Liu <ltao@redhat.com> To: Gal Pressman <gal.pressman@linux.dev> Cc: mrgolin@amazon.com, sleybo@amazon.com, jgg@ziepe.ca, leon@kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org Subject: Re: Implementing .shutdown method for efa module Date: Tue, 26 Mar 2024 09:38:33 +0800 [thread overview] Message-ID: <CAO7dBbXLU5teiYm8VvES7e7m7dUzJQYV9HHLOFKperjwq-NJeA@mail.gmail.com> (raw) In-Reply-To: <5d81d6d0-5afc-4d0e-8d2b-445d48921511@linux.dev> Hi Gal, On Mon, Mar 25, 2024 at 4:06 PM Gal Pressman <gal.pressman@linux.dev> wrote: > > On 25/03/2024 4:10, Tao Liu wrote: > > Hi, > > > > Recently I experienced a kernel panic which is related to efa module > > when testing kexec -l && kexec -e to switch to a new kernel on AWS > > i4g.16xlarge instance. > > > > Here is the dmesg log: > > > > [ 6.379918] systemd[1]: Mounting FUSE Control File System... > > [ 6.381984] systemd[1]: Mounting Kernel Configuration File System... > > [ 6.383918] systemd[1]: Starting Apply Kernel Variables... > > [ 6.385430] systemd[1]: Started Journal Service. > > [ 6.394221] ACPI: bus type drm_connector registered > > [ 6.421408] systemd-journald[1263]: Received client request to > > flush runtime journal. > > [ 7.262543] efa 0000:00:1b.0: enabling device (0010 -> 0012) > > [ 7.432420] efa 0000:00:1b.0: Setup irq:191 name:efa-mgmnt@pci:0000:00:1b.0 > > [ 7.435581] efa 0000:00:1b.0 efa_0: IB device registered > > [ 7.885564] random: crng init done > > [ 8.139857] XFS (nvme0n1p2): Mounting V5 Filesystem > > d7003ecc-db6f-4bfb-bf92-60376b6a6563 > > [ 8.265233] XFS (nvme0n1p2): Ending clean mount > > [ 10.555612] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > > > > Red Hat Enterprise Linux 9.4 Beta (Plow) > > Kernel 5.14.0-425.el9.aarch64 on an aarch64 > > > > ip-10-0-27-226 login: [ 29.940381] kexec_core: Starting new kernel > > [ 30.079279] psci: CPU1 killed (polled 0 ms) > > [ 30.119222] psci: CPU2 killed (polled 0 ms) > > [ 30.199293] psci: CPU3 killed (polled 0 ms) > > [ 30.309214] psci: CPU4 killed (polled 0 ms) > > [ 30.379221] psci: CPU5 killed (polled 0 ms) > > [ 30.419210] psci: CPU6 killed (polled 0 ms) > > [ 30.489207] IRQ 191: no longer affine to CPU7 > > [ 30.489667] psci: CPU7 killed (polled 0 ms) > > ..snip... > > [ 33.849123] psci: CPU63 killed (polled 0 ms) > > [ 33.849943] Bye! > > [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x413fd0c1] > > [ 0.000000] Linux version 5.14.0-417.el9.aarch64 > > (mockbuild@arm64-025.build.eng.bos.redhat.com) (gcc (GCC) 11.4.1 > > 20231218 (Red Hat 11.4.1-3), GNU ld version 2.35.2-42.el9) #1 SMP > > PREEMPT_DYNAMIC Thu Feb 1 21:23:03 EST 2024 > > ...snip... > > [ 1.012692] Freeing unused kernel memory: 6016K > > [ 2.370947] Checked W+X mappings: passed, no W+X pages found > > [ 2.370980] Run /init as init process > > [ 2.370982] with arguments: > > [ 2.370983] /init > > [ 2.370984] with environment: > > [ 2.370984] HOME=/ > > [ 2.370985] TERM=linux > > [ 2.373257] Kernel panic - not syncing: Attempted to kill init! > > exitcode=0x0000000b > > [ 2.373259] CPU: 1 PID: 1 Comm: init Not tainted 5.14.0-417.el9.aarch64 #1 > > [ 2.382240] Hardware name: Amazon EC2 i4g.16xlarge/, BIOS 1.0 11/1/2018 > > [ 2.383814] Call trace: > > [ 2.384410] dump_backtrace+0xa8/0x120 > > [ 2.385318] show_stack+0x1c/0x30 > > [ 2.386124] dump_stack_lvl+0x74/0x8c > > [ 2.387011] dump_stack+0x14/0x24 > > [ 2.387810] panic+0x158/0x368 > > [ 2.388553] do_exit+0x3a8/0x3b0 > > [ 2.389333] do_group_exit+0x38/0xa4 > > [ 2.390195] get_signal+0x7a4/0x810 > > [ 2.391044] do_signal+0x1bc/0x260 > > [ 2.391870] do_notify_resume+0x108/0x210 > > [ 2.392839] el0_da+0x154/0x160 > > [ 2.393603] el0t_64_sync_handler+0xdc/0x150 > > [ 2.394628] el0t_64_sync+0x17c/0x180 > > [ 2.395513] SMP: stopping secondary CPUs > > [ 2.396483] Kernel Offset: 0x586f04e00000 from 0xffff800008000000 > > [ 2.397934] PHYS_OFFSET: 0x40000000 > > [ 2.398774] CPU features: 0x0,00000101,70020143,10417a0b > > [ 2.400042] Memory Limit: none > > [ 2.400783] ---[ end Kernel panic - not syncing: Attempted to kill > > init! exitcode=0x0000000b ]--- > > > > In the dmesg log, I found "[ 30.489207] IRQ 191: no longer affine to > > CPU7" is suspicious, which is related to efa module. After blacklist > > efa module from automatic loading when bootup, the kernel panic issue > > doesn't appear again. > > > > It looks to me it is due to the efa being not properly shutdown during > > kexec, so the ongoing DMA/interrupts etc overwrite the memory range. > > > > Though the issue is reproduced on rhel's kernel, the upstream kernel > > [1] doesn't have the .shutdown method implemented either. Since I'm > > not very familiar with the efa driver, could you please implement the > > .shutdown method in drivers/infiniband/hw/efa/efa_main.c? Thanks in > > advance! > > Did you try to reproduce it on upstream kernel? > Thanks for your comments! No I haven't, I will give it a try. > > > > [1]: https://github.com/torvalds/linux/blob/master/drivers/infiniband/hw/efa/efa_main.c#L674 > > > > Thanks, > > Tao Liu > > > > Try assigning efa_remove as the shutdown callback: > .shutdown = efa_remove, > > Does it fix it? Thanks, I will also try the code, and I will post the testing results. Thanks, Tao Liu >
WARNING: multiple messages have this Message-ID (diff)
From: Tao Liu <ltao@redhat.com> To: Gal Pressman <gal.pressman@linux.dev> Cc: mrgolin@amazon.com, sleybo@amazon.com, jgg@ziepe.ca, leon@kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org Subject: Re: Implementing .shutdown method for efa module Date: Tue, 26 Mar 2024 09:38:33 +0800 [thread overview] Message-ID: <CAO7dBbXLU5teiYm8VvES7e7m7dUzJQYV9HHLOFKperjwq-NJeA@mail.gmail.com> (raw) In-Reply-To: <5d81d6d0-5afc-4d0e-8d2b-445d48921511@linux.dev> Hi Gal, On Mon, Mar 25, 2024 at 4:06 PM Gal Pressman <gal.pressman@linux.dev> wrote: > > On 25/03/2024 4:10, Tao Liu wrote: > > Hi, > > > > Recently I experienced a kernel panic which is related to efa module > > when testing kexec -l && kexec -e to switch to a new kernel on AWS > > i4g.16xlarge instance. > > > > Here is the dmesg log: > > > > [ 6.379918] systemd[1]: Mounting FUSE Control File System... > > [ 6.381984] systemd[1]: Mounting Kernel Configuration File System... > > [ 6.383918] systemd[1]: Starting Apply Kernel Variables... > > [ 6.385430] systemd[1]: Started Journal Service. > > [ 6.394221] ACPI: bus type drm_connector registered > > [ 6.421408] systemd-journald[1263]: Received client request to > > flush runtime journal. > > [ 7.262543] efa 0000:00:1b.0: enabling device (0010 -> 0012) > > [ 7.432420] efa 0000:00:1b.0: Setup irq:191 name:efa-mgmnt@pci:0000:00:1b.0 > > [ 7.435581] efa 0000:00:1b.0 efa_0: IB device registered > > [ 7.885564] random: crng init done > > [ 8.139857] XFS (nvme0n1p2): Mounting V5 Filesystem > > d7003ecc-db6f-4bfb-bf92-60376b6a6563 > > [ 8.265233] XFS (nvme0n1p2): Ending clean mount > > [ 10.555612] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > > > > Red Hat Enterprise Linux 9.4 Beta (Plow) > > Kernel 5.14.0-425.el9.aarch64 on an aarch64 > > > > ip-10-0-27-226 login: [ 29.940381] kexec_core: Starting new kernel > > [ 30.079279] psci: CPU1 killed (polled 0 ms) > > [ 30.119222] psci: CPU2 killed (polled 0 ms) > > [ 30.199293] psci: CPU3 killed (polled 0 ms) > > [ 30.309214] psci: CPU4 killed (polled 0 ms) > > [ 30.379221] psci: CPU5 killed (polled 0 ms) > > [ 30.419210] psci: CPU6 killed (polled 0 ms) > > [ 30.489207] IRQ 191: no longer affine to CPU7 > > [ 30.489667] psci: CPU7 killed (polled 0 ms) > > ..snip... > > [ 33.849123] psci: CPU63 killed (polled 0 ms) > > [ 33.849943] Bye! > > [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x413fd0c1] > > [ 0.000000] Linux version 5.14.0-417.el9.aarch64 > > (mockbuild@arm64-025.build.eng.bos.redhat.com) (gcc (GCC) 11.4.1 > > 20231218 (Red Hat 11.4.1-3), GNU ld version 2.35.2-42.el9) #1 SMP > > PREEMPT_DYNAMIC Thu Feb 1 21:23:03 EST 2024 > > ...snip... > > [ 1.012692] Freeing unused kernel memory: 6016K > > [ 2.370947] Checked W+X mappings: passed, no W+X pages found > > [ 2.370980] Run /init as init process > > [ 2.370982] with arguments: > > [ 2.370983] /init > > [ 2.370984] with environment: > > [ 2.370984] HOME=/ > > [ 2.370985] TERM=linux > > [ 2.373257] Kernel panic - not syncing: Attempted to kill init! > > exitcode=0x0000000b > > [ 2.373259] CPU: 1 PID: 1 Comm: init Not tainted 5.14.0-417.el9.aarch64 #1 > > [ 2.382240] Hardware name: Amazon EC2 i4g.16xlarge/, BIOS 1.0 11/1/2018 > > [ 2.383814] Call trace: > > [ 2.384410] dump_backtrace+0xa8/0x120 > > [ 2.385318] show_stack+0x1c/0x30 > > [ 2.386124] dump_stack_lvl+0x74/0x8c > > [ 2.387011] dump_stack+0x14/0x24 > > [ 2.387810] panic+0x158/0x368 > > [ 2.388553] do_exit+0x3a8/0x3b0 > > [ 2.389333] do_group_exit+0x38/0xa4 > > [ 2.390195] get_signal+0x7a4/0x810 > > [ 2.391044] do_signal+0x1bc/0x260 > > [ 2.391870] do_notify_resume+0x108/0x210 > > [ 2.392839] el0_da+0x154/0x160 > > [ 2.393603] el0t_64_sync_handler+0xdc/0x150 > > [ 2.394628] el0t_64_sync+0x17c/0x180 > > [ 2.395513] SMP: stopping secondary CPUs > > [ 2.396483] Kernel Offset: 0x586f04e00000 from 0xffff800008000000 > > [ 2.397934] PHYS_OFFSET: 0x40000000 > > [ 2.398774] CPU features: 0x0,00000101,70020143,10417a0b > > [ 2.400042] Memory Limit: none > > [ 2.400783] ---[ end Kernel panic - not syncing: Attempted to kill > > init! exitcode=0x0000000b ]--- > > > > In the dmesg log, I found "[ 30.489207] IRQ 191: no longer affine to > > CPU7" is suspicious, which is related to efa module. After blacklist > > efa module from automatic loading when bootup, the kernel panic issue > > doesn't appear again. > > > > It looks to me it is due to the efa being not properly shutdown during > > kexec, so the ongoing DMA/interrupts etc overwrite the memory range. > > > > Though the issue is reproduced on rhel's kernel, the upstream kernel > > [1] doesn't have the .shutdown method implemented either. Since I'm > > not very familiar with the efa driver, could you please implement the > > .shutdown method in drivers/infiniband/hw/efa/efa_main.c? Thanks in > > advance! > > Did you try to reproduce it on upstream kernel? > Thanks for your comments! No I haven't, I will give it a try. > > > > [1]: https://github.com/torvalds/linux/blob/master/drivers/infiniband/hw/efa/efa_main.c#L674 > > > > Thanks, > > Tao Liu > > > > Try assigning efa_remove as the shutdown callback: > .shutdown = efa_remove, > > Does it fix it? Thanks, I will also try the code, and I will post the testing results. Thanks, Tao Liu > _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
next prev parent reply other threads:[~2024-03-26 1:39 UTC|newest] Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top 2024-03-25 2:10 Implementing .shutdown method for efa module Tao Liu 2024-03-25 2:10 ` Tao Liu 2024-03-25 8:06 ` Gal Pressman 2024-03-25 8:06 ` Gal Pressman 2024-03-26 1:38 ` Tao Liu [this message] 2024-03-26 1:38 ` Tao Liu 2024-03-26 12:34 ` Margolin, Michael 2024-03-26 12:34 ` Margolin, Michael 2024-03-26 15:32 ` Jason Gunthorpe 2024-03-26 15:32 ` Jason Gunthorpe 2024-04-01 13:23 ` Margolin, Michael 2024-04-01 13:23 ` Margolin, Michael 2024-04-03 15:44 ` Jason Gunthorpe 2024-04-03 15:44 ` Jason Gunthorpe 2024-04-04 6:54 ` Margolin, Michael 2024-04-04 6:54 ` Margolin, Michael 2024-04-25 3:27 ` Tao Liu 2024-04-25 3:27 ` Tao Liu 2024-03-29 11:58 ` Tao Liu 2024-03-29 11:58 ` Tao Liu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAO7dBbXLU5teiYm8VvES7e7m7dUzJQYV9HHLOFKperjwq-NJeA@mail.gmail.com \ --to=ltao@redhat.com \ --cc=gal.pressman@linux.dev \ --cc=jgg@ziepe.ca \ --cc=kexec@lists.infradead.org \ --cc=leon@kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-rdma@vger.kernel.org \ --cc=mrgolin@amazon.com \ --cc=sleybo@amazon.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.