Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

From: "Christopher M. Riedl" <cmr@linux.ibm.com>
To: "Nicholas Piggin" <npiggin@gmail.com>, <linuxppc-dev@lists.ozlabs.org>
Cc: <tglx@linutronix.de>, <x86@kernel.org>, <keescook@chromium.org>,
	<linux-hardening@vger.kernel.org>
Subject: Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching
Date: Fri, 09 Jul 2021 00:03:17 -0500	[thread overview]
Message-ID: <CCOCHBK9FCOF.1KLFA4IL2AJNE@oc8246131445.ibm.com> (raw)
In-Reply-To: <1625125043.re5l9zg4kg.astroid@bobo.none>

On Thu Jul 1, 2021 at 2:51 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> > On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> >> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> >> > address to be patched is temporarily mapped as writeable. Currently, a
> >> > per-cpu vmalloc patch area is used for this purpose. While the patch
> >> > area is per-cpu, the temporary page mapping is inserted into the kernel
> >> > page tables for the duration of patching. The mapping is exposed to CPUs
> >> > other than the patching CPU - this is undesirable from a hardening
> >> > perspective. Use a temporary mm instead which keeps the mapping local to
> >> > the CPU doing the patching.
> >> > 
> >> > Use the `poking_init` init hook to prepare a temporary mm and patching
> >> > address. Initialize the temporary mm by copying the init mm. Choose a
> >> > randomized patching address inside the temporary mm userspace address
> >> > space. The patching address is randomized between PAGE_SIZE and
> >> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> >> > the Book3s64 Hash MMU operates - by default the space above
> >> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> >> > all platforms/MMUs is randomized inside this range.  The number of
> >> > possible random addresses is dependent on PAGE_SIZE and limited by
> >> > DEFAULT_MAP_WINDOW.
> >> > 
> >> > Bits of entropy with 64K page size on BOOK3S_64:
> >> > 
> >> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> >> > 
> >> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >> >         bits of entropy = log2(128TB / 64K) bits of entropy = 31
> >> > 
> >> > Randomization occurs only once during initialization at boot.
> >> > 
> >> > Introduce two new functions, map_patch() and unmap_patch(), to
> >> > respectively create and remove the temporary mapping with write
> >> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> >> > the page for patching with PAGE_SHARED since the kernel cannot access
> >> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> >> > 
> >> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> >> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> >> > taking an SLB and Hash fault during patching.
> >>
> >> What prevents the SLBE or HPTE from being removed before the last
> >> access?
> > 
> > This code runs with local IRQs disabled - we also don't access anything
> > else in userspace so I'm not sure what else could cause the entries to
> > be removed TBH.
> > 
> >>
> >>
> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> > +
> >> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >> >  {
> >> > -	struct vm_struct *area;
> >> > +	int err;
> >> >  
> >> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> >> > -	if (!area) {
> >> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> >> > -			cpu);
> >> > -		return -1;
> >> > -	}
> >> > -	this_cpu_write(text_poke_area, area);
> >> > +	if (radix_enabled())
> >> > +		return 0;
> >> >  
> >> > -	return 0;
> >> > -}
> >> > +	err = slb_allocate_user(patching_mm, patching_addr);
> >> > +	if (err)
> >> > +		pr_warn("map patch: failed to allocate slb entry\n");
> >> >  
> >> > -static int text_area_cpu_down(unsigned int cpu)
> >> > -{
> >> > -	free_vm_area(this_cpu_read(text_poke_area));
> >> > -	return 0;
> >> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> >> > +			   HPTE_USE_KERNEL_KEY);
> >> > +	if (err)
> >> > +		pr_warn("map patch: failed to insert hashed page\n");
> >> > +
> >> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
> >> > +	isync();
> >>
> >> I'm not sure if this is enough. Could we context switch here? You've
> >> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> >> How about taking an machine check that clears the SLB? Could the HPTE
> >> get removed by something else here?
> > 
> > All of this happens after a local_irq_save() which should at least
> > prevent context switches IIUC.
>
> Ah yeah I didn't look that far back. A machine check can take out SLB
> entries.
>
> > I am not sure what else could cause the
> > HPTE to get removed here.
>
> Other CPUs?
>

Right because the HPTEs are "global".

> >> You want to prevent faults because you might be patching a fault
> >> handler?
> > 
> > In a more general sense: I don't think we want to take page faults every
> > time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> > fault handler codepath also checks `current->mm` in some places which
> > won't match the temporary mm. Also `current->mm` can be NULL which
> > caused problems in my earlier revisions of this series.
>
> Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and
> setting current->mm properly would explode too much. Maybe that's
> okayish.
> But I can't see how the HPT code is up to the job of this in general
> (even if that current->mm issue was fixed).
>
> To do it without holes you would either have to get the SLB MCE handler
> to restore that particular SLB if it flushed it, or restart the patch
> code from a fixup location if it took an MCE after installing the SLB.
> And bolt a hash table entry.

We discussed this a bit off list and decided that it's not worth the
trouble implementing percpu temp mm support for Hash at this time.
Instead, I will post a new version of this series where we drop into
realmode to patch with the Hash MMU. This avoids the W+X mapping
altogether and so doesn't expose anything to other CPUs during patching.
We will keep the Radix support for a percpu temp mm since 1) it doesn't
require hacks like Hash and 2) it's overall preferable to dropping into
realmode.

>
> Thanks,
> Nick