linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Baoquan He <bhe@redhat.com>
To: Mike Rapoport <rppt@linux.ibm.com>
Cc: Steve Wahl <steve.wahl@hpe.com>,
	David Hildenbrand <david@redhat.com>,
	mgorman@suse.de, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org, cai@lca.pw,
	mhocko@kernel.org
Subject: Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages()
Date: Mon, 1 Jun 2020 21:31:59 +0800	[thread overview]
Message-ID: <20200601133159.GM26955@MiWiFi-R3L-srv> (raw)
In-Reply-To: <20200601114246.GC8502@linux.ibm.com>

On 06/01/20 at 02:42pm, Mike Rapoport wrote:
> On Thu, May 28, 2020 at 10:15:10AM -0500, Steve Wahl wrote:
> > On Thu, May 28, 2020 at 05:07:31PM +0800, Baoquan He wrote:
> > > On 05/26/20 at 01:49pm, David Hildenbrand wrote:
> > > > On 26.05.20 13:32, Mike Rapoport wrote:
> > > > > Hello Baoquan,
> > > > > 
> > > > > I do not object to your original fix with careful check for pfn validity.
> > > > > 
> > > > > But I still think that the memory reserved by the firmware is still
> > > > > memory and it should be added to memblock.memory. This way the memory
> > > > 
> > > > If it's really memory that could be read/written, I think I agree.
> > > 
> > > I would say some of them may not be allowed to be read/written, if I
> > > understand it correctly. I roughly went through the x86 init code, there
> > > are some places where mem region is marked as E820_TYPE_RESERVED so that
> > > they are not touched after initialization. E.g:
> > > 
> > > 1) pfn 0
> > > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can
> > > see the code comment, this is a BIOS owned area, but not kernel RAM.
> > > 
> > > 2)GART reserved region
> > > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region
> > > in an area, firmware designer won't map system RAM into that area.
> > > 
> > > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these
> > > regions are not system RAM backed area, reading from or writting into
> > > these area may cause error.
> > > 
> > > Futhermore, there's a KASLR bug found by HPE, its triggering and root
> > > cause are written into below commit log. You can see that accessing to
> > > firmware reserved region caused BIOS to halt system when cpu doing
> > > speculative.
> > > 
> > > commit 2aa85f246c181b1fa89f27e8e20c5636426be624
> > > Author: Steve Wahl <steve.wahl@hpe.com>
> > > Date:   Tue Sep 24 16:03:55 2019 -0500
> > > 
> > >     x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
> > > 
> > >     Our hardware (UV aka Superdome Flex) has address ranges marked
> > >     reserved by the BIOS. Access to these ranges is caught as an error,
> > >     causing the BIOS to halt the system.
> > 
> > Thank you for CC'ing me.  Coming into the middle of the conversation,
> > I am trying to understand the implications of what's being discussed
> > here, but not being very successful at it.
> > 
> > For the HPE UV hardware, the addresses listed in the reserved range
> > must never be accessed, or (as we discovered) even be reachable by an
> > active page table entry.  Any active page table entry covering the
> > region allows the processor to issue speculative accesses to the
> > region, resulting in the BIOS halt mentioned.
> > 
> > I'm not sure if the discussion above about adding the region to
> > memblock.memory would result in an active mapping of the region or
> > not.  If it did, we'd have problems.
> 
> The discussion is whether regions marked as E820_RESERVED should be
> considered as RAM or not.

IMHO, the discussion is about whether firmware reserved regions like
E820_TYPE_RESERVED should be added into memory allocators.

> 
> For the hardware that cannot tolerate acceses to these areas like HPE
> UV, it should not :)
> 
> I still think that keeping them only in memblock.reserved creates more
> problems than it solves, but simply adding E820_RESERVED areas to
> memblock.memory just won't work.

Currently, we hardly add E820_RESERVED into memblock, including
memblock.memmory and memblock.reserved. But, in several places, people
add them to memblock.reserved when they have marked them as
E820_TYPE_RESERVED.

I agree with David that these pages aren't available for any later
memory allocator. "the node/zone is just completely irrelevant for
these pages". Keeping them out of the memory management system is safer
so that nobody won't touch them. I tried to mark them out with a new
type PageUnavail as David suggested, am still struggling to think of
where I should detect and exclude them.

Below is a draft patch about my attempt on marking out unavailable pages.
Any comment or suggestions are welcomed. Of course for any better idea
with code change, I would like to test and review. Hope we can finally fix
this issue after its exposure. 

From 25ca041258a40bb315f94a23f0465b510b1614a0 Mon Sep 17 00:00:00 2001
From: Baoquan He <bhe@redhat.com>
Date: Fri, 29 May 2020 20:23:13 +0800
Subject: [RFC PATCH] mm: Add a new page type to mark unavailable pages

Firstly, the last user of early_pfn_valid() is memmap_init_zone(), but that code
has been optimized away. Let's remove it.

Secondly, add a new page type to mark unavailale pages.

Thirdly, add a new early_pfn_valid() so that init_unavailable_range()
can initialize those pages in memblock.reserved.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 arch/x86/include/asm/mmzone_32.h |  2 --
 include/linux/mmzone.h           | 18 ++++++++++--------
 include/linux/page-flags.h       |  8 ++++++++
 mm/page_alloc.c                  |  3 ++-
 4 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
index 73d8dd14dda2..4da45eff2942 100644
--- a/arch/x86/include/asm/mmzone_32.h
+++ b/arch/x86/include/asm/mmzone_32.h
@@ -49,8 +49,6 @@ static inline int pfn_valid(int pfn)
 	return 0;
 }
 
-#define early_pfn_valid(pfn)	pfn_valid((pfn))
-
 #endif /* CONFIG_DISCONTIGMEM */
 
 #endif /* _ASM_X86_MMZONE_32_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e8d7d0b0acf4..47d09be73dca 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1315,20 +1315,23 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
 #endif
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
-static inline int pfn_valid(unsigned long pfn)
+static inline int early_pfn_valid(unsigned long pfn)
 {
-	struct mem_section *ms;
-
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
-	ms = __nr_to_section(pfn_to_section_nr(pfn));
-	if (!valid_section(ms))
+	return valid_section(__pfn_to_section(pfn))
+}
+
+static inline int pfn_valid(unsigned long pfn)
+{
+	if (!early_pfn_valid(pfn))
 		return 0;
 	/*
 	 * Traditionally early sections always returned pfn_valid() for
 	 * the entire section-sized span.
 	 */
-	return early_section(ms) || pfn_section_valid(ms, pfn);
+	return (early_section(ms) && PageUnavail(pfn_to_page(pfn))) ||
+	       pfn_section_valid(ms, pfn);
 }
 #endif
 
@@ -1364,7 +1367,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
 #define pfn_to_nid(pfn)		(0)
 #endif
 
-#define early_pfn_valid(pfn)	pfn_valid(pfn)
 void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
@@ -1385,7 +1387,7 @@ struct mminit_pfnnid_cache {
 };
 
 #ifndef early_pfn_valid
-#define early_pfn_valid(pfn)	(1)
+#define early_pfn_valid(pfn)	pfn_valid(pfn)
 #endif
 
 void memory_present(int nid, unsigned long start, unsigned long end);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 222f6f7b2bb3..1c011008100c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -740,6 +740,7 @@ PAGEFLAG_FALSE(DoubleMap)
 #define PG_kmemcg	0x00000200
 #define PG_table	0x00000400
 #define PG_guard	0x00000800
+#define PG_unavail	0x00001000
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -796,6 +797,13 @@ PAGE_TYPE_OPS(Table, table)
  */
 PAGE_TYPE_OPS(Guard, guard)
 
+/*
+ *  Mark pages which are unavailable to memory allocators after boot.
+ *  They includes pages reserved by firmware pre-boot or during boot,
+ *  holes which share the same setcion with usable RAM.
+ */
+PAGE_TYPE_OPS(Unavail, unavail)
+
 extern bool is_free_buddy_page(struct page *page);
 
 __PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 940cdce96864..e9ebef6ffbec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6972,7 +6972,8 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
 		 * (in memblock.reserved but not in memblock.memory) will
 		 * get re-initialized via reserve_bootmem_region() later.
 		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+		mm_zero_struct_page(pfn_to_page(pfn));
+		__SetPageUnavail(pfn_to_page(pfn));
 		__SetPageReserved(pfn_to_page(pfn));
 		pgcnt++;
 	}
-- 
2.17.2



  reply	other threads:[~2020-06-01 13:32 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-21  1:44 Baoquan He
2020-05-21  9:26 ` Mike Rapoport
2020-05-21 15:52   ` Baoquan He
2020-05-21 17:18     ` Mike Rapoport
2020-05-22  7:01       ` Baoquan He
2020-05-22  7:25         ` Baoquan He
2020-05-22 14:20           ` Mike Rapoport
2020-05-26  8:45             ` Baoquan He
2020-05-26  8:55               ` David Hildenbrand
2020-05-26 11:32               ` Mike Rapoport
2020-05-26 11:49                 ` David Hildenbrand
2020-05-28  9:07                   ` Baoquan He
2020-05-28  9:08                     ` David Hildenbrand
2020-05-28 15:15                     ` Steve Wahl
2020-06-01 11:42                       ` Mike Rapoport
2020-06-01 13:31                         ` Baoquan He [this message]
2020-05-21  9:36 ` Mel Gorman
2020-05-21 13:38   ` Mike Rapoport
2020-05-21 15:41   ` Baoquan He
2020-05-28  8:59 [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() ^[ Baoquan He
2020-05-28  9:08 ` Baoquan He

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200601133159.GM26955@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cai@lca.pw \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=rppt@linux.ibm.com \
    --cc=steve.wahl@hpe.com \
    --subject='Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages()' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).