Re: [PATCH] arm64: kexec: add support for kexec with spin-table

From: Henry Willard <henry.willard@oracle.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>,
	"will@kernel.org" <will@kernel.org>,
	"tabba@google.com" <tabba@google.com>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"ardb@kernel.org" <ardb@kernel.org>,
	"samitolvanen@google.com" <samitolvanen@google.com>,
	"joe@perches.com" <joe@perches.com>,
	"nixiaoming@huawei.com" <nixiaoming@huawei.com>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] arm64: kexec: add support for kexec with spin-table
Date: Thu, 15 Jul 2021 00:08:27 +0000	[thread overview]
Message-ID: <A5B4FB88-ECDA-43D6-9369-93F2096D7EAE@oracle.com> (raw)
In-Reply-To: <20210714184733.GB28555@C02TD0UTHF1T.local>

Hi, Mark,
Thanks for reviewing this. I am not in a position to go into too much detail about the particular device, but the u-boot we are using is the u-boot we have to use, at least for now. We would have preferred to have PSCI, but that option is not available. Modifying u-boot is not an option.

It is possible to do this without relying on the spin-table loop. I implemented such a version using the kexec code control page before I got my hands on the device actually using spin-table. That implementaiton needed changes in a lot of places, because the secondary CPUs had to leave the code control page before the boot CPU enters the new kernel. Reusing the spin-table loop simplified things quite a bit. 

This has been useful to us, so we thought we would pass it along to see if it is useful to anyone else in the same situation.

> On Jul 14, 2021, at 11:47 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> 
> Hi Henry,
> 
> On Wed, Jul 14, 2021 at 10:41:13AM -0700, Henry Willard wrote:
>> With one special exception kexec is not supported on systems
>> that use spin-table as the cpu enablement method instead of PSCI.
>> The spin-table implementation lacks cpu_die() and several other
>> methods needed by the hotplug framework used by kexec on Arm64.
>> 
>> Some embedded systems may not have a need for the Arm Trusted
>> Firmware, or they may lack it during early bring-up. Some of
>> these may have a more primitive version of u-boot that uses a
>> special device from which to load the kernel. Kexec can be
>> especially useful for testing new kernels in such an environment.
>> 
>> What is needed to support kexec is some place for cpu_die to park
>> the secondary CPUs outside the kernel while the primary copies
>> the new kernel into place and starts it. One possibility is to
>> use the control-code-page where arm64_relocate_new_kernel_size()
>> executes, but that requires a complicated and racy dance to get
>> the secondary CPUs from the control-code-page to the new
>> kernel after it has been copied.
>> 
>> The spin-table mechanism is setup before the Linux kernel
>> is entered with details provided in the device tree. The
>> "release-address" DT variable provides the address of a word the
>> secondary CPUs are polling. The boot CPU will store the real address
>> of secondary_holding_pen() at that address, and the secondary CPUs
>> will branch to that address. secondary_holding_pen() is another
>> loop where the secondary CPUs wait to be called up by the boot CPU.
>> 
>> This patch uses that mechanism to implement cpu_die(). In modern
>> versions of u-boot that implement spin-table, the address of the
>> loop in protected memory can be derived from the "release-address"
>> value. The patch validates the existence of the loop before
>> proceeding. smp_spin_table_cpu_die() uses cpu_soft_restart() to
>> branch to the loop with the MMU and caching turned off where the
>> CPU waits until released by the new kernel. After that kexec
>> reboot proceeds normally.
> 
> This isn't true for all spin-table implementations; for example this is
> not safe with the boot-wrapper.
> 
> While, I'm not necessarily opposed to providing a mechanism to return a
> CPU back to the spin-table, the presence of that mechanism needs to be
> explicitly defined in the device tree (e.g. with a "cpu-return-addr"
> property or similar), and we need to thoroughly document the contract
> (e.g. what state the CPU is in when it is returned). We've generally
> steered clear of this since it is much more complicated than it may
> initially seem, and there is immense scope for error.
> 
> If we do choose to extend spin-table in this way, we'll also need to
> enforce that each cpu has a unique cpu-release-address, or this is
> unsound to begin with (since e.g. the kernel can't return CPUs that it
> doesn't know are stuck in the holding pen). We will also need a
> mechanism to reliably identify when the CPU has been successfully
> returned.
> 
> I would very much like to avoid this if possible. U-Boot does have a
> PSCI implementation that some platforms use; is it not possible to use
> this?

Unfortunately, no. If we had that we would never have bothered with this.

> 
> If this is for early bringup, and you're using the first kernel as a
> bootloader, I'd suggest that you boot that with "nosmp", such that the
> first kernel doesn't touch the secondary CPUs at all.

The particular case that spawned this is past that. There are a number of reasons why we need to be able to kexec a new kernel. Being able to bypass the kernel installation process, which is a little more complicated than normal, to test a new kernels is an added benefit.

> 
>> The special exception is the kdump capture kernel, which gets
>> started even if the secondaries can't be stopped.
>> 
>> Signed-off-by: Henry Willard <henry.willard@oracle.com>
>> ---
>> arch/arm64/kernel/smp_spin_table.c | 111 +++++++++++++++++++++++++++++++++++++
>> 1 file changed, 111 insertions(+)
>> 
>> diff --git a/arch/arm64/kernel/smp_spin_table.c b/arch/arm64/kernel/smp_spin_table.c
>> index 7e1624ecab3c..35c7fa764476 100644
>> --- a/arch/arm64/kernel/smp_spin_table.c
>> +++ b/arch/arm64/kernel/smp_spin_table.c
>> @@ -13,16 +13,27 @@
>> #include <linux/mm.h>
>> 
>> #include <asm/cacheflush.h>
>> +#include <asm/daifflags.h>
>> #include <asm/cpu_ops.h>
>> #include <asm/cputype.h>
>> #include <asm/io.h>
>> #include <asm/smp_plat.h>
>> +#include <asm/mmu_context.h>
>> +#include <asm/kexec.h>
>> +
>> +#include "cpu-reset.h"
>> 
>> extern void secondary_holding_pen(void);
>> volatile unsigned long __section(".mmuoff.data.read")
>> secondary_holding_pen_release = INVALID_HWID;
>> 
>> static phys_addr_t cpu_release_addr[NR_CPUS];
>> +static unsigned int spin_table_loop[4] = {
>> +	0xd503205f,        /* wfe */
>> +	0x58000060,        /* ldr  x0, spin_table_cpu_release_addr */
>> +	0xb4ffffc0,        /* cbnz x0, 0b */
>> +	0xd61f0000         /* br   x0 */
>> +};
>> 
>> /*
>>  * Write secondary_holding_pen_release in a way that is guaranteed to be
>> @@ -119,9 +130,109 @@ static int smp_spin_table_cpu_boot(unsigned int cpu)
>> 	return 0;
>> }
>> 
>> +
>> +/*
>> + * There is a four instruction loop set aside in protected
>> + * memory by u-boot where secondary CPUs wait for the kernel to
>> + * start.
>> + *
>> + * 0:       wfe
>> + *          ldr    x0, spin_table_cpu_release_addr
>> + *          cbz    x0, 0b
>> + *          br     x0
>> + * spin_table_cpu_release_addr:
>> + *          .quad  0
>> + *
>> + * The address of spin_table_cpu_release_addr is passed in the
>> + * "release-address" property in the device table.
>> + * smp_spin_table_cpu_prepare() stores the real address of
>> + * secondary_holding_pen() where the secondary CPUs loop
>> + * until they are released one at a time by smp_spin_table_cpu_boot().
>> + * We reuse the spin-table loop by clearing spin_table_cpu_release_addr,
>> + * and branching to the beginning of the loop via cpu_soft_restart(),
>> + * which turns off the MMU and caching.
>> + */
>> +static void smp_spin_table_cpu_die(unsigned int cpu)
>> +{
>> +	__le64 __iomem *release_addr;
>> +	unsigned int *spin_table_inst;
>> +	unsigned long spin_table_start;
>> +
>> +	if (!cpu_release_addr[cpu])
>> +		goto spin;
>> +
>> +	spin_table_start = (cpu_release_addr[cpu] - sizeof(spin_table_loop));
>> +
>> +	/*
>> +	 * The cpu-release-addr may or may not be inside the linear mapping.
>> +	 * As ioremap_cache will either give us a new mapping or reuse the
>> +	 * existing linear mapping, we can use it to cover both cases. In
>> +	 * either case the memory will be MT_NORMAL.
>> +	 */
>> +	release_addr = ioremap_cache(spin_table_start,
>> +				sizeof(*release_addr) +
>> +				sizeof(spin_table_loop));
>> +
>> +	if (!release_addr)
>> +		goto spin;
>> +
>> +	spin_table_inst = (unsigned int *)release_addr;
>> +	if (spin_table_inst[0] != spin_table_loop[0] ||
>> +		spin_table_inst[1] != spin_table_loop[1] ||
>> +		spin_table_inst[2] != spin_table_loop[2] ||
>> +		spin_table_inst[3] != spin_table_loop[3])
>> +		goto spin;
> 
> Please don't hard-code a specific sequence for this; if we *really* need
> this, we should be given a cpu-return-addr explicitly, and we should
> simply trust it.

That would require changes to u-boot. The purpose is to detect if we get a new version of u-boot with a different loop. Seems remote since this particular loop has been this way for quite some time, and it works well.

> 
>> +
>> +	/*
>> +	 * Clear the release address, so that we can use it again
>> +	 */
>> +	writeq_relaxed(0, release_addr + 2);
>> +	dcache_clean_inval_poc((__force unsigned long)(release_addr + 2),
>> +			(__force unsigned long)(release_addr + 2) +
>> +				    sizeof(*release_addr));
> 
> What is the `+ 2` for?

Yeah, I could have been clearer. The spin_table_cpu_release_addr variable sits at +0x10 past the spin-table loop. 

> 
>> +
>> +	iounmap(release_addr);
>> +
>> +	local_daif_mask();
>> +	cpu_soft_restart(spin_table_start, 0, 0, 0);
>> +
>> +	BUG();  /* Should never get here */
>> +
>> +spin:
>> +	cpu_park_loop();
>> +
>> +}
>> +
>> +static int smp_spin_table_cpu_kill(unsigned int cpu)
>> +{
>> +	unsigned long start, end;
>> +
>> +	start = jiffies;
>> +	end = start + msecs_to_jiffies(100);
>> +
>> +	do {
>> +		if (!cpu_online(cpu)) {
>> +			pr_info("CPU%d killed\n", cpu);
>> +			return 0;
>> +		}
>> +	} while (time_before(jiffies, end));
>> +	pr_warn("CPU%d may not have shut down cleanly\n", cpu);
>> +	return -ETIMEDOUT;
>> +
>> +}
> 
> If we're going to extend this, we must add a mechanism to reliably
> identify when the CPU has been returned successfully. We can't rely on
> cpu_online(), becuase there's a window between the CPU marking itself as
> offline and actually exiting the kernel.
> 
>> +
>> +/* Nothing to do here */
>> +static int smp_spin_table_cpu_disable(unsigned int cpu)
>> +{
>> +	return 0;
>> +}
> 
> For implementations where we cannot return the CPU, cpu_disable() *must*
> fail.
> 
> Thanks,
> Mark.

Thanks for taking the time to review this.

Henry