Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / Atom feed
From: James Morse <james.morse@arm.com>
To: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: sashal@kernel.org, corbet@lwn.net, catalin.marinas@arm.com,
	linux-doc@vger.kernel.org, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, jmorris@namei.org,
	ebiederm@xmission.com, will@kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation
Date: Wed, 17 Jul 2019 18:51:41 +0100
Message-ID: <4c8a3a11-adc2-efa4-f765-6be338546ae4@arm.com> (raw)
In-Reply-To: <20190716165641.6990-1-pasha.tatashin@soleen.com>

Hi Pavel,

On 16/07/2019 17:56, Pavel Tatashin wrote:
> Added identity mapped page table, and keep MMU enabled while
> kernel is being relocated from sparse pages to the final
> destination during kexec.

The 'tl;dr' version of this: I strongly urge you to start with the hibernate code that
already covers all these known corner cases. x86 was not a good starting point.


After a quick skim:

This will map 'nomap' regions of memory with cacheable attributes. This is a non-starter.
These regions were described by firmware as having content that was/is written with
different attributes. The attributes must match whenever it is mapped, otherwise we have a
loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch it into the
cache whenever it likes.
It may be important that we do not ever map some of these regions, even though its
described as memory. On AMD-Seattle the bottom page of memory is reserved by firmware for
its own use; it is made secure-only, and any access causes an
external-abort/machine-check. UEFI describes this as 'Reserved', and we preserve this in
the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with the
'nomap' attribute.

Mapping a 'new'/unknown region with cacheable attributes can never be safe, even if we
trusted kexec-tool to only write the kernel to memory. The host may be using a bigger page
size causing more memory to become cacheable than was intended.
Linux's EFI support rounds the UEFI memory map to the largest support page size, (and
winges about firmware bugs).
If we're allowing kexec to load images in a region not described as IORESOURCE_SYSTEM_RAM,
that is a bug we should fix.

The only way to do this properly is to copy the linear mapping. The arch code has lots of
complex code to generate it correctly at boot, we do not want to duplicate it.
(this is why hibernate copies the linear mapping)


These patches do not remove the running page tables from TTBR1. As you overwrite the live
page tables you will corrupt the state of the CPU. The page-table walker may access things
that aren't memory, cache memory that shouldn't be cached (see above), and allocate
conflicting entries in the TLB.

You cannot use the mm page table helpers to build an idmap on arm64. The mm page table
helpers have a compile-time VA_BITS, and we support systems where there is no memory below
1<<VA_BITS. (crazy huh!). Picking on AMD-Seattle again: if you boot a 4K 39bit VA kernel,
the idmap will have more page table levels than the page table helpers can build. This is
why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
You already need to copy the linear-map, so using an idmap is extra work. You want to work
with linear-map addresses, you probably need to add the field to the appropriate structure.

The kexec relocation code still runs at EL2. You can't use a copy of the linear map here
as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been torn back to
the hyp-stub. This is the reason hibernate posts EL2 in a holding pen while it rewrites
all of memory, then calls back to fixup EL2. Keeping the rewrite phase at EL1 means it
doesn't need independently tweaking/testing. You need to do something similar, either
calling EL2 to start the new image, or disabling the MMU at EL1 to start the new image there.

You will need to alter the relocation code to do nothing for kdump, as no relocation is
required and building page-tables is extra work where the kernel may croak, preventing us
from reaching kdump.

Finally, having this independent idmap machinery isn't desirable from a maintenance
perspective. Please start with the hibernate code that already solves a very similar
problem, as it already has most of these problems covered.


> This patch series works in terms, that I can kexec-reboot both in QEMU

I wouldn't expect Qemu's emulation of the MMU and caches to be performance accurate.


> and on a physical machine. However, I do not see performance improvement
> during relocation. The performance is just as slow as before with disabled
> caches.

> Am I missing something? Perhaps, there is some flag that I should also
> enable in page table? Please provide me with any suggestions.

Some information about the physical machine you tested this on would help.
I'm guessing its v8.0, and booted at EL2....


Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply index

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-16 16:56 Pavel Tatashin
2019-07-16 16:56 ` [RFC v1 1/4] arm64, mm: identity mapped page table Pavel Tatashin
2019-07-16 16:56 ` [RFC v1 2/4] arm64, kexec: interface preparation for mmu enabled kexec Pavel Tatashin
2019-07-16 16:56 ` [RFC v1 3/4] arm64, kexec: add kexec's own identity page table Pavel Tatashin
2019-07-16 16:56 ` [RFC v1 4/4] arm64: Keep MMU on while kernel is being relocated Pavel Tatashin
2019-07-16 19:14 ` [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation Bhupesh Sharma
2019-07-16 19:26   ` Pavel Tatashin
2019-07-17 17:51 ` James Morse [this message]
2019-07-17 19:13   ` Pavel Tatashin
2019-07-26 14:00     ` James Morse

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4c8a3a11-adc2-efa4-f765-6be338546ae4@arm.com \
    --to=james.morse@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=ebiederm@xmission.com \
    --cc=jmorris@namei.org \
    --cc=kexec@lists.infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=sashal@kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ARM-Kernel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/0 linux-arm-kernel/git/0.git
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/1 linux-arm-kernel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-arm-kernel linux-arm-kernel/ https://lore.kernel.org/linux-arm-kernel \
		linux-arm-kernel@lists.infradead.org
	public-inbox-index linux-arm-kernel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-arm-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git