[v3,6/7] x86/alternatives: use temporary mm for text poking
diff mbox series

Message ID 20181102232946.98461-7-namit@vmware.com
State New
Headers show
Series
  • x86/alternatives: text_poke() fixes
Related show

Commit Message

Nadav Amit Nov. 2, 2018, 11:29 p.m. UTC
text_poke() can potentially compromise the security as it sets temporary
PTEs in the fixmap. These PTEs might be used to rewrite the kernel code
from other cores accidentally or maliciously, if an attacker gains the
ability to write onto kernel memory.

Moreover, since remote TLBs are not flushed after the temporary PTEs are
removed, the time-window in which the code is writable is not limited if
the fixmap PTEs - maliciously or accidentally - are cached in the TLB.
To address these potential security hazards, we use a temporary mm for
patching the code.

More adventurous developers can try to reorder the init sequence or use
text_poke_early() instead of text_poke() to remove the use of fixmap for
patching completely.

Finally, text_poke() is also not conservative enough when mapping pages,
as it always tries to map 2 pages, even when a single one is sufficient.
So try to be more conservative, and do not map more than needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/include/asm/fixmap.h |   2 -
 arch/x86/kernel/alternative.c | 112 +++++++++++++++++++++++++++-------
 2 files changed, 91 insertions(+), 23 deletions(-)

Comments

Peter Zijlstra Nov. 5, 2018, 1:19 p.m. UTC | #1
On Fri, Nov 02, 2018 at 04:29:45PM -0700, Nadav Amit wrote:
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 9ceae28db1af..1a40df4db450 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c

> @@ -699,41 +700,110 @@ __ro_after_init unsigned long poking_addr;
>   */
>  void *text_poke(void *addr, const void *opcode, size_t len)
>  {
> +	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
> +	temporary_mm_state_t prev;
>  	struct page *pages[2];
> +	unsigned long flags;
> +	pte_t pte, *ptep;
> +	spinlock_t *ptl;
>  
>  	/*
> +	 * While boot memory allocator is running we cannot use struct pages as
> +	 * they are not yet initialized.
>  	 */
>  	BUG_ON(!after_bootmem);
>  
>  	if (!core_kernel_text((unsigned long)addr)) {
>  		pages[0] = vmalloc_to_page(addr);
> +		if (cross_page_boundary)
> +			pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
>  	} else {
>  		pages[0] = virt_to_page(addr);
>  		WARN_ON(!PageReserved(pages[0]));
> +		if (cross_page_boundary)
> +			pages[1] = virt_to_page(addr + PAGE_SIZE);
>  	}
> +
> +	/* TODO: let the caller deal with a failure and fail gracefully. */
>  	BUG_ON(!pages[0]);
> +	BUG_ON(cross_page_boundary && !pages[1]);
>  	local_irq_save(flags);
> +
> +	/*
> +	 * The lock is not really needed, but this allows to avoid open-coding.
> +	 */
> +	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
> +
> +	/*
> +	 * If we failed to allocate a PTE, fail silently. The caller (text_poke)

we _are_ text_poke()..

> +	 * will detect that the write failed when it compares the memory with
> +	 * the new opcode.
> +	 */
> +	if (unlikely(!ptep))
> +		goto out;

This is the one site I'm a little uncomfortable with; OTOH it really
never should happen, since we explicitily instantiate these page-tables
earlier.

Can't we simply assume ptep will not be zero here? Like with so many
boot time memory allocations, we mostly assume they'll work.

> +	pte = mk_pte(pages[0], PAGE_KERNEL);
> +	set_pte_at(poking_mm, poking_addr, ptep, pte);
> +
> +	if (cross_page_boundary) {
> +		pte = mk_pte(pages[1], PAGE_KERNEL);
> +		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
> +	}
> +
> +	/*
> +	 * Loading the temporary mm behaves as a compiler barrier, which
> +	 * guarantees that the PTE will be set at the time memcpy() is done.
> +	 */
> +	prev = use_temporary_mm(poking_mm);
> +
> +	kasan_disable_current();
> +	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
> +	kasan_enable_current();
> +
> +	/*
> +	 * Ensure that the PTE is only cleared after the instructions of memcpy
> +	 * were issued by using a compiler barrier.
> +	 */
> +	barrier();
> +
> +	pte_clear(poking_mm, poking_addr, ptep);
> +
> +	/*
> +	 * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
> +	 * as it also flushes the corresponding "user" address spaces, which
> +	 * does not exist.
> +	 *
> +	 * Poking, however, is already very inefficient since it does not try to
> +	 * batch updates, so we ignore this problem for the time being.
> +	 *
> +	 * Since the PTEs do not exist in other kernel address-spaces, we do
> +	 * not use __flush_tlb_one_kernel(), which when PTI is on would cause
> +	 * more unwarranted TLB flushes.
> +	 *
> +	 * There is a slight anomaly here: the PTE is a supervisor-only and
> +	 * (potentially) global and we use __flush_tlb_one_user() but this
> +	 * should be fine.
> +	 */
> +	__flush_tlb_one_user(poking_addr);
> +	if (cross_page_boundary) {
> +		pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1);
> +		__flush_tlb_one_user(poking_addr + PAGE_SIZE);
> +	}
> +
> +	/*
> +	 * Loading the previous page-table hierarchy requires a serializing
> +	 * instruction that already allows the core to see the updated version.
> +	 * Xen-PV is assumed to serialize execution in a similar manner.
> +	 */
> +	unuse_temporary_mm(prev);
> +
> +	pte_unmap_unlock(ptep, ptl);
> +out:
> +	/*
> +	 * TODO: allow the callers to deal with potential failures and do not
> +	 * panic so easily.
> +	 */
> +	BUG_ON(memcmp(addr, opcode, len));
>  	local_irq_restore(flags);
>  	return addr;
>  }
Peter Zijlstra Nov. 5, 2018, 1:30 p.m. UTC | #2
On Fri, Nov 02, 2018 at 04:29:45PM -0700, Nadav Amit wrote:
> +	unuse_temporary_mm(prev);
> +
> +	pte_unmap_unlock(ptep, ptl);

That; that does kunmap_atomic() on 32bit.

I've been thinking that the whole kmap_atomic thing on x86_32 is
terminally broken, and with that most of x86_32 is.

kmap_atomic does the per-cpu fixmap pte fun-and-games we're here saying
is broken. Yes, only the one CPU will (explicitly) use those fixmap PTEs
and thus the local invalidate _should_ work. However nothing prohibits
speculation on another CPU from using our fixmap addresses. Which can
lead to the remote CPU populating its TLBs for our fixmap entry.

And, as we've found, there are AMD parts that #MC when there are
mis-matched TLB entries.

So what do we do? mark x86_32 SMP broken?
Nadav Amit Nov. 5, 2018, 6:04 p.m. UTC | #3
From: Peter Zijlstra
Sent: November 5, 2018 at 1:30:41 PM GMT
> To: Nadav Amit <namit@vmware.com>
> Cc: Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, x86@kernel.org, H. Peter Anvin <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, Kees Cook <keescook@chromium.org>, Dave Hansen <dave.hansen@intel.com>, Masami Hiramatsu <mhiramat@kernel.org>
> Subject: Re: [PATCH v3 6/7] x86/alternatives: use temporary mm for text poking
> 
> 
> On Fri, Nov 02, 2018 at 04:29:45PM -0700, Nadav Amit wrote:
>> +	unuse_temporary_mm(prev);
>> +
>> +	pte_unmap_unlock(ptep, ptl);
> 
> That; that does kunmap_atomic() on 32bit.
> 
> I've been thinking that the whole kmap_atomic thing on x86_32 is
> terminally broken, and with that most of x86_32 is.
> 
> kmap_atomic does the per-cpu fixmap pte fun-and-games we're here saying
> is broken. Yes, only the one CPU will (explicitly) use those fixmap PTEs
> and thus the local invalidate _should_ work. However nothing prohibits
> speculation on another CPU from using our fixmap addresses. Which can
> lead to the remote CPU populating its TLBs for our fixmap entry.
> 
> And, as we've found, there are AMD parts that #MC when there are
> mis-matched TLB entries.
> 
> So what do we do? mark x86_32 SMP broken?

pte_unmap() seems to only use kunmap_atomic() when CONFIG_HIGHPTE is set, no?

Do most distributions run with CONFIG_HIGHPTE?
Peter Zijlstra Nov. 6, 2018, 8:20 a.m. UTC | #4
On Mon, Nov 05, 2018 at 06:04:42PM +0000, Nadav Amit wrote:
> From: Peter Zijlstra
> Sent: November 5, 2018 at 1:30:41 PM GMT
> > To: Nadav Amit <namit@vmware.com>
> > Cc: Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org, x86@kernel.org, H. Peter Anvin <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, Kees Cook <keescook@chromium.org>, Dave Hansen <dave.hansen@intel.com>, Masami Hiramatsu <mhiramat@kernel.org>
> > Subject: Re: [PATCH v3 6/7] x86/alternatives: use temporary mm for text poking
> > 
> > 
> > On Fri, Nov 02, 2018 at 04:29:45PM -0700, Nadav Amit wrote:
> >> +	unuse_temporary_mm(prev);
> >> +
> >> +	pte_unmap_unlock(ptep, ptl);
> > 
> > That; that does kunmap_atomic() on 32bit.
> > 
> > I've been thinking that the whole kmap_atomic thing on x86_32 is
> > terminally broken, and with that most of x86_32 is.
> > 
> > kmap_atomic does the per-cpu fixmap pte fun-and-games we're here saying
> > is broken. Yes, only the one CPU will (explicitly) use those fixmap PTEs
> > and thus the local invalidate _should_ work. However nothing prohibits
> > speculation on another CPU from using our fixmap addresses. Which can
> > lead to the remote CPU populating its TLBs for our fixmap entry.
> > 
> > And, as we've found, there are AMD parts that #MC when there are
> > mis-matched TLB entries.
> > 
> > So what do we do? mark x86_32 SMP broken?
> 
> pte_unmap() seems to only use kunmap_atomic() when CONFIG_HIGHPTE is set, no?
> 
> Do most distributions run with CONFIG_HIGHPTE?

Sure; but all of x86_32 relies on kmap_atomic. This was just the the one
way I ran into it again.

By our current way of thinking, kmap_atomic simply is not correct.
Peter Zijlstra Nov. 6, 2018, 1:11 p.m. UTC | #5
On Tue, Nov 06, 2018 at 09:20:19AM +0100, Peter Zijlstra wrote:

> By our current way of thinking, kmap_atomic simply is not correct.

Something like the below; which weirdly builds an x86_32 kernel.
Although I imagine a very sad one.

---

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ba7e3464ee92..e273f3879d04 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1449,6 +1449,16 @@ config PAGE_OFFSET
 config HIGHMEM
 	def_bool y
 	depends on X86_32 && (HIGHMEM64G || HIGHMEM4G)
+	depends on !SMP || BROKEN
+	help
+	  By current thinking kmap_atomic() is broken, since it relies on per
+	  CPU PTEs in the global (kernel) address space and relies on CPU local
+	  TLB invalidates to completely invalidate these PTEs. However there is
+	  nothing that guarantees other CPUs will not speculatively touch upon
+	  'our' fixmap PTEs and load then into their TLBs, after which our
+	  local TLB invalidate will not invalidate them.
+
+	  There are AMD chips that will #MC on inconsistent TLB states.
 
 config X86_PAE
 	bool "PAE (Physical Address Extension) Support"
Nadav Amit Nov. 6, 2018, 6:11 p.m. UTC | #6
From: Peter Zijlstra
Sent: November 6, 2018 at 1:11:19 PM GMT
> To: Nadav Amit <namit@vmware.com>
> Cc: Ingo Molnar <mingo@redhat.com>, LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>, H. Peter Anvin <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, Andy Lutomirski <luto@kernel.org>, Kees Cook <keescook@chromium.org>, Dave Hansen <dave.hansen@intel.com>, Masami Hiramatsu <mhiramat@kernel.org>
> Subject: Re: [PATCH v3 6/7] x86/alternatives: use temporary mm for text poking
> 
> 
> On Tue, Nov 06, 2018 at 09:20:19AM +0100, Peter Zijlstra wrote:
> 
>> By our current way of thinking, kmap_atomic simply is not correct.
> 
> Something like the below; which weirdly builds an x86_32 kernel.
> Although I imagine a very sad one.
> 
> ---
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index ba7e3464ee92..e273f3879d04 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1449,6 +1449,16 @@ config PAGE_OFFSET
> config HIGHMEM
> 	def_bool y
> 	depends on X86_32 && (HIGHMEM64G || HIGHMEM4G)
> +	depends on !SMP || BROKEN
> +	help
> +	  By current thinking kmap_atomic() is broken, since it relies on per
> +	  CPU PTEs in the global (kernel) address space and relies on CPU local
> +	  TLB invalidates to completely invalidate these PTEs. However there is
> +	  nothing that guarantees other CPUs will not speculatively touch upon
> +	  'our' fixmap PTEs and load then into their TLBs, after which our
> +	  local TLB invalidate will not invalidate them.
> +
> +	  There are AMD chips that will #MC on inconsistent TLB states.
> 
> config X86_PAE
> 	bool "PAE (Physical Address Extension) Support”

Please help me understand the scenario you are worried about. I see several
(potentially) concerning situations due to long lived mappings:

1. Inconsistent cachability in the PAT (between two different mappings of
the same physical memory), causing memory ordering issues.

2. Inconsistent access-control (between two different mappings of the same
physical memory), allowing to circumvent security hardening mechanisms.

3. Invalid cachability in the PAT for MMIO, causing #MC

4. Faulty memory being mapped, causing #MC

5. Some potential data leakage due to long lived mappings

The #MC you mention, I think, regards something that resembles (3) -
speculative page-walks using cachable memory caused #MC when this memory was
set on MMIO region. This memory, IIUC, was mistakenly presumed to be used by
page-tables, so I don’t see how it is relevant for kmap_atomic().

As for the other situations, excluding (2), which this series is intended to
deal with, I don’t see a huge problem which cannot be resolved in different
means.
Peter Zijlstra Nov. 6, 2018, 7:08 p.m. UTC | #7
On Tue, Nov 06, 2018 at 06:11:18PM +0000, Nadav Amit wrote:
> From: Peter Zijlstra
> > On Tue, Nov 06, 2018 at 09:20:19AM +0100, Peter Zijlstra wrote:
> > 
> >> By our current way of thinking, kmap_atomic simply is not correct.
> > 
> > Something like the below; which weirdly builds an x86_32 kernel.
> > Although I imagine a very sad one.
> > 
> > ---
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index ba7e3464ee92..e273f3879d04 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1449,6 +1449,16 @@ config PAGE_OFFSET
> > config HIGHMEM
> > 	def_bool y
> > 	depends on X86_32 && (HIGHMEM64G || HIGHMEM4G)
> > +	depends on !SMP || BROKEN
> > +	help
> > +	  By current thinking kmap_atomic() is broken, since it relies on per
> > +	  CPU PTEs in the global (kernel) address space and relies on CPU local
> > +	  TLB invalidates to completely invalidate these PTEs. However there is
> > +	  nothing that guarantees other CPUs will not speculatively touch upon
> > +	  'our' fixmap PTEs and load then into their TLBs, after which our
> > +	  local TLB invalidate will not invalidate them.
> > +
> > +	  There are AMD chips that will #MC on inconsistent TLB states.
> > 
> > config X86_PAE
> > 	bool "PAE (Physical Address Extension) Support”
> 
> Please help me understand the scenario you are worried about. I see several
> (potentially) concerning situations due to long lived mappings:
> 
> 1. Inconsistent cachability in the PAT (between two different mappings of
> the same physical memory), causing memory ordering issues.
> 
> 2. Inconsistent access-control (between two different mappings of the same
> physical memory), allowing to circumvent security hardening mechanisms.
> 
> 3. Invalid cachability in the PAT for MMIO, causing #MC
> 
> 4. Faulty memory being mapped, causing #MC
> 
> 5. Some potential data leakage due to long lived mappings
> 
> The #MC you mention, I think, regards something that resembles (3) -
> speculative page-walks using cachable memory caused #MC when this memory was
> set on MMIO region. This memory, IIUC, was mistakenly presumed to be used by
> page-tables, so I don’t see how it is relevant for kmap_atomic().
> 
> As for the other situations, excluding (2), which this series is intended to
> deal with, I don’t see a huge problem which cannot be resolved in different
> means.

mostly #3 and related I think; kmap_atomic is a stack and any entry can
be used for whatever is needed. When the remote CPU does a speculative
hit on our fixmap entry, that translation will get populated.

When we then unmap and flush (locally) and re-establish that mapping for
something else; the CPU might #MC because the translations are
incompatible.

Imagine one being some MMIO mapping for i915 and another being a regular
user address with incompatible cachebility or something.

Now the remote CPU will never actually use those translations except for
speculation. But I'm terribly uncomfortable with this.

It might all just work; but not doing global flushes for global mapping
changes makes me itch.

Patch
diff mbox series

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 50ba74a34a37..9da8cccdf3fb 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -103,8 +103,6 @@  enum fixed_addresses {
 #ifdef CONFIG_PARAVIRT
 	FIX_PARAVIRT_BOOTMAP,
 #endif
-	FIX_TEXT_POKE1,	/* reserve 2 pages for text_poke() */
-	FIX_TEXT_POKE0, /* first page is last, because allocation is backward */
 #ifdef	CONFIG_X86_INTEL_MID
 	FIX_LNW_VRTC,
 #endif
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9ceae28db1af..1a40df4db450 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -11,6 +11,7 @@ 
 #include <linux/stop_machine.h>
 #include <linux/slab.h>
 #include <linux/kdebug.h>
+#include <linux/mmu_context.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -699,41 +700,110 @@  __ro_after_init unsigned long poking_addr;
  */
 void *text_poke(void *addr, const void *opcode, size_t len)
 {
-	unsigned long flags;
-	char *vaddr;
+	bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
+	temporary_mm_state_t prev;
 	struct page *pages[2];
-	int i;
+	unsigned long flags;
+	pte_t pte, *ptep;
+	spinlock_t *ptl;
 
 	/*
-	 * While boot memory allocator is runnig we cannot use struct
-	 * pages as they are not yet initialized.
+	 * While boot memory allocator is running we cannot use struct pages as
+	 * they are not yet initialized.
 	 */
 	BUG_ON(!after_bootmem);
 
 	if (!core_kernel_text((unsigned long)addr)) {
 		pages[0] = vmalloc_to_page(addr);
-		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
+		if (cross_page_boundary)
+			pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
 	} else {
 		pages[0] = virt_to_page(addr);
 		WARN_ON(!PageReserved(pages[0]));
-		pages[1] = virt_to_page(addr + PAGE_SIZE);
+		if (cross_page_boundary)
+			pages[1] = virt_to_page(addr + PAGE_SIZE);
 	}
+
+	/* TODO: let the caller deal with a failure and fail gracefully. */
 	BUG_ON(!pages[0]);
+	BUG_ON(cross_page_boundary && !pages[1]);
 	local_irq_save(flags);
-	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
-	if (pages[1])
-		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
-	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
-	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
-	clear_fixmap(FIX_TEXT_POKE0);
-	if (pages[1])
-		clear_fixmap(FIX_TEXT_POKE1);
-	local_flush_tlb();
-	sync_core();
-	/* Could also do a CLFLUSH here to speed up CPU recovery; but
-	   that causes hangs on some VIA CPUs. */
-	for (i = 0; i < len; i++)
-		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
+
+	/*
+	 * The lock is not really needed, but this allows to avoid open-coding.
+	 */
+	ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
+
+	/*
+	 * If we failed to allocate a PTE, fail silently. The caller (text_poke)
+	 * will detect that the write failed when it compares the memory with
+	 * the new opcode.
+	 */
+	if (unlikely(!ptep))
+		goto out;
+
+	pte = mk_pte(pages[0], PAGE_KERNEL);
+	set_pte_at(poking_mm, poking_addr, ptep, pte);
+
+	if (cross_page_boundary) {
+		pte = mk_pte(pages[1], PAGE_KERNEL);
+		set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
+	}
+
+	/*
+	 * Loading the temporary mm behaves as a compiler barrier, which
+	 * guarantees that the PTE will be set at the time memcpy() is done.
+	 */
+	prev = use_temporary_mm(poking_mm);
+
+	kasan_disable_current();
+	memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+	kasan_enable_current();
+
+	/*
+	 * Ensure that the PTE is only cleared after the instructions of memcpy
+	 * were issued by using a compiler barrier.
+	 */
+	barrier();
+
+	pte_clear(poking_mm, poking_addr, ptep);
+
+	/*
+	 * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
+	 * as it also flushes the corresponding "user" address spaces, which
+	 * does not exist.
+	 *
+	 * Poking, however, is already very inefficient since it does not try to
+	 * batch updates, so we ignore this problem for the time being.
+	 *
+	 * Since the PTEs do not exist in other kernel address-spaces, we do
+	 * not use __flush_tlb_one_kernel(), which when PTI is on would cause
+	 * more unwarranted TLB flushes.
+	 *
+	 * There is a slight anomaly here: the PTE is a supervisor-only and
+	 * (potentially) global and we use __flush_tlb_one_user() but this
+	 * should be fine.
+	 */
+	__flush_tlb_one_user(poking_addr);
+	if (cross_page_boundary) {
+		pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1);
+		__flush_tlb_one_user(poking_addr + PAGE_SIZE);
+	}
+
+	/*
+	 * Loading the previous page-table hierarchy requires a serializing
+	 * instruction that already allows the core to see the updated version.
+	 * Xen-PV is assumed to serialize execution in a similar manner.
+	 */
+	unuse_temporary_mm(prev);
+
+	pte_unmap_unlock(ptep, ptl);
+out:
+	/*
+	 * TODO: allow the callers to deal with potential failures and do not
+	 * panic so easily.
+	 */
+	BUG_ON(memcmp(addr, opcode, len));
 	local_irq_restore(flags);
 	return addr;
 }