From: Pavel Tatashin <firstname.lastname@example.org> To: email@example.com, firstname.lastname@example.org, Linus Torvalds <email@example.com>, Thomas Gleixner <firstname.lastname@example.org>, Andrew Morton <email@example.com>, Sasha Levin <firstname.lastname@example.org>, LKML <email@example.com>, James Morse <firstname.lastname@example.org>, Will Deacon <email@example.com>, firstname.lastname@example.org, Michael Ellerman <email@example.com>, AKASHI Takahiro <firstname.lastname@example.org>, Dan Williams <email@example.com>, linux-mm <firstname.lastname@example.org>, Tyler Hicks <email@example.com> Subject: improving crash dump discussion Date: Wed, 10 Feb 2021 16:48:51 -0500 [thread overview] Message-ID: <CA+CK2bC7Qe-EDX8mZ_OvfH+9rfiYHCGK++znivu+SKvi8HGpkg@mail.gmail.com> (raw) I would like to start a discussion about how we can improve Linux crash dump facility, and use warm reboot / firmware assistance in order to more reliably collect crash dumps while using fewer memory resources and being more performant. Currently, the main way to collect crash dumps on Linux is to use kdump. Kdump uses kexec in order to collect dumps. Kdump makes use of kexec, which is mature and portable (does not depend on firmware), but using kexec is not ideal. I will list some problems with kexec/kdump, and then discuss how some of them (hopefully most) can be addressed. 1. Expecting a crashing kernel to do the right thing: properly quiesce devices, CPUs and prepare the machine for the new kernel. The amount of code that is executed to perform crash kexec reboot is not trivial. Unfortunately, since we are panicking we already lost control at some point and the goal would be to reduce the amount of code executed by the panic handler in order to be able to reliably collect dumps. There are some ways to improve the reliability of crash kexec reboot. For example, passing maxcpus=1 kernel parameter is now the required on almost all platforms, which, unfortunately, has the downside of forcing crash kernel to use only a single thread to save core, and thus "makedumpfile --num-thread" is useless if used from crash kernel. 2. Unlike booting from firmware, the PCI, CPUs, interrupt controllers, DMAs mappings, and I/O devices are not reinitialized and might not be in a consistent state. The reset_devices, irqpoll, and other kernel parameters also intend to mitigate these shortfalls by requiring drivers to do the resetting themselves. Also, the kernel is usually smart enough to ignore spurious interrupts, but this is fragile. 3. There is a blackout window during boot where collecting a crash dump is not possible. With current kdump it is possible to collect crashes that occur after the kernel early boot is finished. During early boot we do a lot: determine platform, initialize mm, initialize clock, scheduler, and start other CPUs. Only after entering usermode, we are able to kexec load crash kernel into memory after which crash can be collected. 4. Kdump is not compatible with hardware watchdog resets When a hardware watchdog causes a reset, software is not involved, and therefore we lose the entire machine state. 5. Crash kernel requires memory reservation Crash kernel can't use the memory that was used by the crashing kernel, therefore memory must always be reserved that is wasted during normal operation, and only contains the image of the crash kernel. 6. Crash kernel requires special image and two reboots Special crash image is usually required to reduce the number of loaded modules, and also to reduce the system to the bare minimum so that it can be booted in the small reserved space. Also, after the crash kernel collects the core dump, we reboot back to the normal kernel, thus two reboots are needed in order to recover after the crash. ========================================================================== On the other hand, powerpc can optionally use firmware assisted kdump (fadump). The benefits of fadump: 1. reboot through firmware happens, and thus all devices are reset to their initial state 2. memory for the crash kernel does not need to be reserved if CMA is used and user pages do not need to be preserved (commonly there is no need to preserve user pages to debug kernel panics). 3. fadump crash format is identical to kdump (ELF /proc/vmcore), therefore tools are the same, i.e. crash(8), makedumpfile, and other all can be used. 4. No need to have a special crash kernel image and no need to do a second reboot from the crash kernel. The following services are expected from firmware in order for fadump to work: 1. Ability to do warm reboot Preserve memory content across reboot. Firmware must not zero (initialize) memory content. From my experience, this is actually common nowadays: I see this happens on my AMD desktop with x570 chip + UEFI BIOS, we do this at Microsoft both on larger Xeon servers with UEFI firmware, and on small arm64 devices which use device trees instead of EFI for performance reasons, and also to preserve emulated pmem devices across reboot. We also did it at Oracle on SPARC sun4v machines where sun4v hypervisor would not reset memory content on every reboot for performance reasons. 2. Ability to register preserved memory region with firmware The first kernel uses firmware to reserve a region of memory that must be preserved when rebooted. Firmware and bootloader must not allocate from preserved regions. 3. Ability to copy boot memory source to destination. On powerpc, boot must start from a lower address, similar like on x86. Also, boot memory is a region of memory that can be used by the kernel to boot, and the rest is added later once the kernel decides to unreserve it: i.e. after vmcore is saved. The copy boot memory is not strictly necessary: the panicking kernel can do the copy on platforms where boot must start from a lower address, and on other platforms where boot can be done from any address the copy is not needed at all (i.e. ARM64, x64). What it comes down to is that there is little that firmware needs to do in order to help Linux to do a more reliable crash dump. It must provide an ability by the kernel to reserve a region of memory from which firmware/bootloader won’t do allocations, and optionally on platforms where the kernel must always boot from a predefined physical address firmware should be able to copy boot memory content. The rest can be done by the kernel alone. Support for hardware watchdog resets is a little more complicated as it would involve firmware to copy CPUs registers content to a predefined place, but it should also be achievable. We could agree on an interface that the kernel would support for both EFI based firmware and device-tree based firmware. We could also add this support to open source projects such as linuxboot, coreboot, OVMF type of firmware and to boot loaders u-boot, grub. Pasha
reply other threads:[~2021-02-10 21:50 UTC|newest] Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CA+CK2bC7Qe-EDX8mZ_OvfH+9rfiYHCGK++znivu+SKvi8HGpkg@mail.gmail.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: improving crash dump discussion' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).