From: Catalin Marinas <catalin.marinas@arm.com>
To: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Cc: James Morse <james.morse@arm.com>,
robh+dt@kernel.org, hch@lst.de, ardb@kernel.org,
linux-kernel@vger.kernel.org, devicetree@vger.kernel.org,
lorenzo.pieralisi@arm.com, will@kernel.org,
jeremy.linton@arm.com, iommu@lists.linux-foundation.org,
linux-rpi-kernel@lists.infradead.org, guohanjun@huawei.com,
robin.murphy@arm.com, linux-arm-kernel@lists.infradead.org,
Chen Zhou <chenzhou10@huawei.com>
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()
Date: Fri, 13 Nov 2020 11:29:02 +0000 [thread overview]
Message-ID: <20201113112901.GA3212@gaia> (raw)
In-Reply-To: <b5336064145a30aadcfdb8920226a8c63f692695.camel@suse.de>
Hi Nicolas,
On Thu, Nov 12, 2020 at 04:56:38PM +0100, Nicolas Saenz Julienne wrote:
> On Tue, 2020-11-10 at 18:17 +0000, Catalin Marinas wrote:
> > On Fri, Nov 06, 2020 at 07:46:29PM +0100, Nicolas Saenz Julienne wrote:
> > > On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote:
> > > > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote:
> > > > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> > > > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> > > > > boot table initialization, so move it later in the boot process.
> > > > > Specifically into mem_init(), this is the last place crashkernel will be
> > > > > able to reserve the memory before the page allocator kicks in.
> > > > > There
> > > > > isn't any apparent reason for doing this earlier.
> > > >
> > > > It's so that map_mem() can carve it out of the linear/direct map.
> > > > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump
> > > > kernel. We depend on this if we continue with kdump, but failed to offline all the other
> > > > CPUs.
> > >
> > > I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only
> > > happen further down the line, after having loaded the kdump kernel image. But
> > > it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS |
> > > NO_CONT_MAPPINGS).
> >
> > IIUC, arch_kexec_protect_crashkres() is only for the crashkernel image,
> > not the whole reserved memory that the crashkernel will use. For the
> > latter, we avoid the linear map by marking it as nomap in map_mem().
>
> I'm not sure we're on the same page here, so sorry if this was already implied.
>
> The crashkernel memory mapping is bypassed while preparing the linear mappings
> but it is then mapped right away, with page granularity and !MTE.
> See paging_init()->map_mem():
>
> /*
> * Use page-level mappings here so that we can shrink the region
> * in page granularity and put back unused memory to buddy system
> * through /sys/kernel/kexec_crash_size interface.
> */
> if (crashk_res.end) {
> __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
> PAGE_KERNEL,
> NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> memblock_clear_nomap(crashk_res.start,
> resource_size(&crashk_res));
> }
>
> IIUC the inconvenience here is that we need special mapping options for
> crashkernel and updating those after having mapped that memory as regular
> memory isn't possible/easy to do.
You are right, it still gets mapped but with page granularity. However,
to James' point, we still need to know the crashkernel range in
map_mem() as arch_kexec_protect_crashkres() relies on having page rather
than block mappings.
> > > > We also depend on this when skipping the checksum code in purgatory, which can be
> > > > exceedingly slow.
> > >
> > > This one I don't fully understand, so I'll lazily assume the prerequisite is
> > > the same WRT how memory is mapped. :)
> > >
> > > Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same
> > > prerequisite.
> > >
> > > Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on
> > > having the linear mappings available.
> >
> > So it looks like reserve_crashkernel() wants to reserve memory before
> > setting up the linear map with the information about the DMA zones in
> > place but that comes later when we can parse the firmware tables.
> >
> > I wonder, instead of not mapping the crashkernel reservation, can we not
> > do an arch_kexec_protect_crashkres() for the whole reservation after we
> > created the linear map?
>
> arch_kexec_protect_crashkres() depends on __change_memory_common() which
> ultimately depends on the memory to be mapped with PAGE_SIZE pages. As I
> comment above, the trick would work as long as there is as way to update the
> linear mappings with whatever crashkernel needs later in the boot process.
Breaking block mappings into pages is a lot more difficult later. OTOH,
the default these days is rodata_full==true, so I don't think we have
block mappings anyway. We could add NO_BLOCK_MAPPINGS if KEXEC_CORE is
enabled.
> > > Let me stress that knowing the DMA constraints in the system before reserving
> > > crashkernel's regions is necessary if we ever want it to work seamlessly on all
> > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> > > memory.
> >
> > Indeed. So we have 3 options (so far):
> >
> > 1. Allow the crashkernel reservation to go into the linear map but set
> > it to invalid once allocated.
> >
> > 2. Parse the flattened DT (not sure what we do with ACPI) before
> > creating the linear map. We may have to rely on some SoC ID here
> > instead of actual DMA ranges.
> >
> > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
> > reservations and not rely on arm64_dma_phys_limit in
> > reserve_crashkernel().
> >
> > I think (2) we tried hard to avoid. Option (3) brings us back to the
> > issues we had on large crashkernel reservations regressing on some
> > platforms (though it's been a while since, they mostly went quiet ;)).
> > However, with Chen's crashkernel patches we end up with two
> > reservations, one in the low DMA zone and one higher, potentially above
> > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
> > reservations than what we have now.
> >
> > If (1) works, I'd go for it (James knows this part better than me),
> > otherwise we can go for (3).
>
> Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not
> I'll append (3) in this series.
I think for 1 we could also remove the additional KEXEC_CORE checks,
something like below, untested:
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 3e5a6913acc8..27ab609c1c0c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp)
int flags = 0;
u64 i;
- if (rodata_full || debug_pagealloc_enabled())
+ if (rodata_full || debug_pagealloc_enabled() ||
+ IS_ENABLED(CONFIG_KEXEC_CORE))
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
/*
@@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp)
* the following for-loop
*/
memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
-#ifdef CONFIG_KEXEC_CORE
- if (crashk_res.end)
- memblock_mark_nomap(crashk_res.start,
- resource_size(&crashk_res));
-#endif
/* map all the memory banks */
for_each_mem_range(i, &start, &end) {
@@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp)
__map_memblock(pgdp, kernel_start, kernel_end,
PAGE_KERNEL, NO_CONT_MAPPINGS);
memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
-
-#ifdef CONFIG_KEXEC_CORE
- /*
- * Use page-level mappings here so that we can shrink the region
- * in page granularity and put back unused memory to buddy system
- * through /sys/kernel/kexec_crash_size interface.
- */
- if (crashk_res.end) {
- __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
- PAGE_KERNEL,
- NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
- memblock_clear_nomap(crashk_res.start,
- resource_size(&crashk_res));
- }
-#endif
}
void mark_rodata_ro(void)
--
Catalin
next prev parent reply other threads:[~2020-11-13 11:30 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-03 17:31 [PATCH v6 0/7] arm64: Default to 32-bit wide ZONE_DMA Nicolas Saenz Julienne
2020-11-03 17:31 ` [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init() Nicolas Saenz Julienne
2020-11-05 16:11 ` James Morse
2020-11-06 18:46 ` Nicolas Saenz Julienne
2020-11-10 18:17 ` Catalin Marinas
2020-11-12 15:56 ` Nicolas Saenz Julienne
2020-11-13 11:29 ` Catalin Marinas [this message]
2020-11-19 14:09 ` Nicolas Saenz Julienne
2020-11-19 17:10 ` Catalin Marinas
2020-11-19 17:25 ` Catalin Marinas
2020-11-19 17:25 ` Nicolas Saenz Julienne
2020-11-19 17:45 ` Catalin Marinas
2020-11-19 18:18 ` James Morse
2020-11-03 17:31 ` [PATCH v6 2/7] arm64: mm: Move zone_dma_bits initialization into zone_sizes_init() Nicolas Saenz Julienne
2020-11-03 17:31 ` [PATCH v6 3/7] of/address: Introduce of_dma_get_max_cpu_address() Nicolas Saenz Julienne
2020-11-03 17:31 ` [PATCH v6 4/7] of: unittest: Add test for of_dma_get_max_cpu_address() Nicolas Saenz Julienne
2020-11-03 17:31 ` [PATCH v6 5/7] arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges Nicolas Saenz Julienne
2020-11-03 17:31 ` [PATCH v6 6/7] arm64: mm: Set ZONE_DMA size based on early IORT scan Nicolas Saenz Julienne
2020-11-03 17:31 ` [PATCH v6 7/7] mm: Remove examples from enum zone_type comment Nicolas Saenz Julienne
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201113112901.GA3212@gaia \
--to=catalin.marinas@arm.com \
--cc=ardb@kernel.org \
--cc=chenzhou10@huawei.com \
--cc=devicetree@vger.kernel.org \
--cc=guohanjun@huawei.com \
--cc=hch@lst.de \
--cc=iommu@lists.linux-foundation.org \
--cc=james.morse@arm.com \
--cc=jeremy.linton@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rpi-kernel@lists.infradead.org \
--cc=lorenzo.pieralisi@arm.com \
--cc=nsaenzjulienne@suse.de \
--cc=robh+dt@kernel.org \
--cc=robin.murphy@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).