All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] x86: restore the write back cache of reserved RAM in iounmap()
@ 2020-11-19 17:59 Andrea Arcangeli
  2020-11-19 17:59 ` [PATCH 1/1] " Andrea Arcangeli
  2020-11-19 18:02 ` [PATCH 0/1] " Christoph Hellwig
  0 siblings, 2 replies; 5+ messages in thread
From: Andrea Arcangeli @ 2020-11-19 17:59 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Andi Kleen, Rafael Aquini, Waiman Long, linux-kernel

Hello everyone,

We identified some PCD set on the direct mapping causing hardly
reproducible performance issues and this patch fixes the ultimate root
cause.

The caller for now has been tweaked to avoid triggering this case (now
that we know about it) however if the observations on the proposed fix
aren't correct, it'd be great if we could still do some other change
to ioremap/iounmap and perhaps the other memtype APIs, to be sure
those PCD/PWT leftovers won't happen again a few years from now in
another place.

For example one more complex alternative would be to use the
page_mapcount of reserved pages (currently unused) to do proper
refcounting on the overlapping ioremap so you can resync the kernel
direct mapping to write back only at the very last iounmap.

Or a much simpler alternative that would remain fully neutral for
overlapping ioremaps, would be to overwrite all page_count of reserved
RAM in a way that if they're ever freed later it'll trigger a crash
during __free_pages.

==
// SPDX-License-Identifier: GPL-2.0-only
/*
 *  ioremap.c
 *
 *  Copyright (C) 2020  Red Hat, Inc.
 *
 *  Reproducer for bug io iounmap.
 */

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <asm/io.h>

#define NR_PAGES 512
#define REPRODUCE

static struct page *pages[NR_PAGES];
static void __iomem *map[NR_PAGES];

int init_module(void)
{
	int i;
	for (i = 0; i < NR_PAGES; i++) {
		pages[i] = alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER-1);
		if (!pages[i])
			break;
		__SetPageReserved(pages[i]);
#ifdef REPRODUCE
		map[i] = ioremap(page_to_phys(pages[i]), 1L<<(MAX_ORDER-1));
		WARN_ON_ONCE(!map[i]);
#endif
	}

	return 0;
}

void cleanup_module(void)
{
	int i;
	for (i = 0; i < NR_PAGES; i++) {
		if (map[i])
			iounmap(map[i]);
		if (!pages[i])
			break;
		__ClearPageReserved(pages[i]);
		__free_pages(pages[i], MAX_ORDER-1);
	}
}

MODULE_LICENSE("GPL");
==

Andrea Arcangeli (1):
  x86: restore the write back cache of reserved RAM in iounmap()

 arch/x86/mm/ioremap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/1] x86: restore the write back cache of reserved RAM in iounmap()
  2020-11-19 17:59 [PATCH 0/1] x86: restore the write back cache of reserved RAM in iounmap() Andrea Arcangeli
@ 2020-11-19 17:59 ` Andrea Arcangeli
  2020-11-19 21:01   ` Thomas Gleixner
  2020-11-19 18:02 ` [PATCH 0/1] " Christoph Hellwig
  1 sibling, 1 reply; 5+ messages in thread
From: Andrea Arcangeli @ 2020-11-19 17:59 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Andi Kleen, Rafael Aquini, Waiman Long, linux-kernel

If reserved memory is mapped with ioremap_noncache() or ioremap_wc(),
the kernel correctly splits the direct mapping and marks the PAGE_SIZE
granular region uncached, so both the virtual direct mapping and the
second virtual mapping in vmap space will be both marked uncached or
write through (i.e.  _PAGE_PCD/PWT set on the pagetable).

However when iounmap is called later, nothing restores the direct
mapping write back memtype.

So if kernel executes this sequence:

   SetPageReserved
   ioremap_nocache
   iounmap
   ClearPageReserved

if the page is ever freed later it remains "uncached" indefinitely.

Those uncached regions can be tiny compared to the total size of the
RAM, so it may take a long time until a performance critical piece of
kernel memory gets allocated in a page that is uncached in the direct
mapping, long after the iounmap that left it uncached.

However when it eventually happens, it generates unexpected severe and
non reproducible kernel slowdowns.

The fix consist in restoring the original write back cache on reserved
RAM at iounmap() time. This is preferable than supporting multiple
overlapping ioremap on the same physical range because:

- the malfunction will happen when an ioremap is still outstanding,
  and ideally it'll happen synchronously at iounmap time, so it should
  be easier to track down than by scanning all kernel pagetables
  searching for any PCD/PWT leftover bits

- two ioremap at the same time of non write back memtype are already
  forbidden by the bugcheck in reserve_ram_page_type that verifies the
  current page type is still _PAGE_CACHE_MODE_WB before
  proceeding. And all ioremap are of write back memtype the patch will
  not make a difference

- even if two ioremap at the same on RAM would be allowed, the caller
  would need to still enforce they all have the same memtype, so it is
  more likely able to enforce that it doesn't do overlapping ioremaps
  at once than to be able to undo the changes to the direct mapping
  pagetables

Fixes: f56d005d3034 ("x86: no CPA on iounmap")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/ioremap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9e5ccc56f8e0..65dbc88edf43 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -437,6 +437,7 @@ EXPORT_SYMBOL(ioremap_prot);
 void iounmap(volatile void __iomem *addr)
 {
 	struct vm_struct *p, *o;
+	u64 p_start, p_end;
 
 	if ((void __force *)addr <= high_memory)
 		return;
@@ -472,12 +473,17 @@ void iounmap(volatile void __iomem *addr)
 		return;
 	}
 
-	memtype_free(p->phys_addr, p->phys_addr + get_vm_area_size(p));
+	p_start = p->phys_addr;
+	p_end = p_start + get_vm_area_size(p);
+	memtype_free(p_start, p_end);
 
 	/* Finally remove it */
 	o = remove_vm_area((void __force *)addr);
 	BUG_ON(p != o || o == NULL);
 	kfree(p);
+	if (o)
+		memtype_kernel_map_sync(p_start, p_end,
+					_PAGE_CACHE_MODE_WB);
 }
 EXPORT_SYMBOL(iounmap);
 


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] x86: restore the write back cache of reserved RAM in iounmap()
  2020-11-19 17:59 [PATCH 0/1] x86: restore the write back cache of reserved RAM in iounmap() Andrea Arcangeli
  2020-11-19 17:59 ` [PATCH 1/1] " Andrea Arcangeli
@ 2020-11-19 18:02 ` Christoph Hellwig
  2020-11-19 19:03   ` Andrea Arcangeli
  1 sibling, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2020-11-19 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Thomas Gleixner, Ingo Molnar, Andi Kleen, Rafael Aquini,
	Waiman Long, linux-kernel

On Thu, Nov 19, 2020 at 12:59:01PM -0500, Andrea Arcangeli wrote:
> Hello everyone,
> 
> We identified some PCD set on the direct mapping causing hardly
> reproducible performance issues and this patch fixes the ultimate root
> cause.
> 
> The caller for now has been tweaked to avoid triggering this case (now
> that we know about it)

What is the callers?  The whole SetPageReservered + ioremap* thing
you mention in the actual patch is completely bogus.  I think we'll
need to reject that as well and fix the caller.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] x86: restore the write back cache of reserved RAM in iounmap()
  2020-11-19 18:02 ` [PATCH 0/1] " Christoph Hellwig
@ 2020-11-19 19:03   ` Andrea Arcangeli
  0 siblings, 0 replies; 5+ messages in thread
From: Andrea Arcangeli @ 2020-11-19 19:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, Ingo Molnar, Andi Kleen, Rafael Aquini,
	Waiman Long, linux-kernel

Hello Christoph,

On Thu, Nov 19, 2020 at 06:02:06PM +0000, Christoph Hellwig wrote:
> What is the callers?  The whole SetPageReservered + ioremap* thing
> you mention in the actual patch is completely bogus.  I think we'll
> need to reject that as well and fix the caller.

The actual caller is not so much the focus here: the point here is to
be able to either handle the caller gracefully or to get a synchronous
kernel crash in __free_pages.

Otherwise the problem induced by such a caller (no matter if right or
wrong) becomes hardly debuggable.

The caller in question was the EFI_BOOT_SERVICE_DATA that is aliased
on non RAM but then freed later by swapping RAM under it.

Of course the caller has already been changed to stick to write back
and that specific caller is not a concern anymore. My concern is if we
leave the callee (iounmap) as it is, what does guarantee us that we
won't hit again in production a few years down the road?

When I first read the caller it felt nothing should have gone wrong,
it looked ok even the version that would leave PCD leftovers bits in
the direct map. So I didn't get why switching to write back would
prevent the PCD leftovers until I looked at the callee (iounmap).

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] x86: restore the write back cache of reserved RAM in iounmap()
  2020-11-19 17:59 ` [PATCH 1/1] " Andrea Arcangeli
@ 2020-11-19 21:01   ` Thomas Gleixner
  0 siblings, 0 replies; 5+ messages in thread
From: Thomas Gleixner @ 2020-11-19 21:01 UTC (permalink / raw)
  To: Andrea Arcangeli, Ingo Molnar
  Cc: Andi Kleen, Rafael Aquini, Waiman Long, linux-kernel

On Thu, Nov 19 2020 at 12:59, Andrea Arcangeli wrote:
> If reserved memory is mapped with ioremap_noncache() or ioremap_wc(),
> the kernel correctly splits the direct mapping and marks the PAGE_SIZE
> granular region uncached, so both the virtual direct mapping and the
> second virtual mapping in vmap space will be both marked uncached or
> write through (i.e.  _PAGE_PCD/PWT set on the pagetable).
>
> However when iounmap is called later, nothing restores the direct
> mapping write back memtype.

Darn. This was discussed in 2008 already and survived that long?

  https://lore.kernel.org/lkml/20080205011357.GA14712@linux-os.sc.intel.com/

No idea how that happened to slip through...

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-19 21:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-19 17:59 [PATCH 0/1] x86: restore the write back cache of reserved RAM in iounmap() Andrea Arcangeli
2020-11-19 17:59 ` [PATCH 1/1] " Andrea Arcangeli
2020-11-19 21:01   ` Thomas Gleixner
2020-11-19 18:02 ` [PATCH 0/1] " Christoph Hellwig
2020-11-19 19:03   ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.