linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup
@ 2012-09-30  7:57 Yinghai Lu
  2012-09-30  7:57 ` [PATCH 01/13] x86, mm: Add global page_size_mask and probe one time only Yinghai Lu
                   ` (12 more replies)
  0 siblings, 13 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

Current kernel init memory mapping between [0, TOML) and [4G, TOMH)
Some AMD systems have mem hole between 4G and TOMH, around 1T.

According to HPA, we should only mapping ram range.

This patcheset:
1. Seperate calculate_table_space_size and find_early_page_table out with
   init_memory_mapping.

2. For all ranges, will allocate page table one time

3. init mapping for ram range one by one.

Could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

Jacob Shin (4):
  x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix
  x86: Fixup code testing if a pfn is direct mapped
  x86: Only direct map addresses that are marked as E820_RAM
  x86/mm: calculate_table_space_size based on memory ranges that are being mapped

Yinghai Lu (9):
  x86, mm: Add global page_size_mask and probe one time only
  x86, mm: Split out split_mem_range from init_memory_mapping
  x86, mm: Move init_memory_mapping calling out of setup.c
  x86, mm: Revert back good_end setting for 64bit
  x86, mm: Find early page table buffer altogether
  x86, mm: Separate out calculate_table_space_size()
  x86, mm: Move down two calculate_table_space_size down.
  x86, mm: Set memblock initial limit to 1M
  x86, mm: Use func pointer to table size calculation and mapping

 arch/x86/include/asm/init.h       |    1 -
 arch/x86/include/asm/page_types.h |    2 +
 arch/x86/include/asm/pgtable.h    |    1 +
 arch/x86/kernel/cpu/amd.c         |    8 +-
 arch/x86/kernel/setup.c           |   36 +++--
 arch/x86/mm/init.c                |  350 +++++++++++++++++++++++++------------
 arch/x86/mm/init_64.c             |    6 +-
 arch/x86/platform/efi/efi.c       |    8 +-
 8 files changed, 273 insertions(+), 139 deletions(-)

-- 
1.7.7


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01/13] x86, mm: Add global page_size_mask and probe one time only
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 02/13] x86, mm: Split out split_mem_range from init_memory_mapping Yinghai Lu
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

Now we pass around use_gbpages and use_pse for calculating page table size,
Later we will need to calculate page table size for every ram range, that
mean those calculation will be done several times.

Those info are the same for all ram range and could be stored in page_size_mask
and only probe them one time.

Move htat probing code from in init_memory_mapping into separated function
probe_page_size_mask, and call it before all init_memory_mapping.

Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/include/asm/pgtable.h |    1 +
 arch/x86/kernel/setup.c        |    1 +
 arch/x86/mm/init.c             |   66 +++++++++++++++++++---------------------
 3 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 402704f..c6f5779 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -598,6 +598,7 @@ static inline int pgd_none(pgd_t pgd)
 #ifndef __ASSEMBLY__
 
 extern int direct_gbpages;
+void probe_page_size_mask(void);
 
 /* local pte updates need not use xchg for locking */
 static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4f16547..20581d7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -912,6 +912,7 @@ void __init setup_arch(char **cmdline_p)
 	setup_real_mode();
 
 	init_gbpages();
+	probe_page_size_mask();
 
 	/* max_pfn_mapped is updated here */
 	max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ab1f6a9..7903d54 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -35,8 +35,10 @@ struct map_range {
 	unsigned page_size_mask;
 };
 
-static void __init find_early_table_space(struct map_range *mr, unsigned long end,
-					  int use_pse, int use_gbpages)
+static int page_size_mask;
+
+static void __init find_early_table_space(struct map_range *mr,
+					  unsigned long end)
 {
 	unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
 	phys_addr_t base;
@@ -44,7 +46,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
 	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
 	tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
 
-	if (use_gbpages) {
+	if (page_size_mask & (1 << PG_LEVEL_1G)) {
 		unsigned long extra;
 
 		extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
@@ -54,7 +56,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
 
 	tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
 
-	if (use_pse) {
+	if (page_size_mask & (1 << PG_LEVEL_2M)) {
 		unsigned long extra;
 
 		extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
@@ -90,6 +92,30 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
 		(pgt_buf_top << PAGE_SHIFT) - 1);
 }
 
+void probe_page_size_mask(void)
+{
+#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
+	/*
+	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
+	 * This will simplify cpa(), which otherwise needs to support splitting
+	 * large pages into small in interrupt context, etc.
+	 */
+	if (direct_gbpages)
+		page_size_mask |= 1 << PG_LEVEL_1G;
+	if (cpu_has_pse)
+		page_size_mask |= 1 << PG_LEVEL_2M;
+#endif
+
+	/* Enable PSE if available */
+	if (cpu_has_pse)
+		set_in_cr4(X86_CR4_PSE);
+
+	/* Enable PGE if available */
+	if (cpu_has_pge) {
+		set_in_cr4(X86_CR4_PGE);
+		__supported_pte_mask |= _PAGE_GLOBAL;
+	}
+}
 void __init native_pagetable_reserve(u64 start, u64 end)
 {
 	memblock_reserve(start, end - start);
@@ -125,45 +151,15 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
 unsigned long __init_refok init_memory_mapping(unsigned long start,
 					       unsigned long end)
 {
-	unsigned long page_size_mask = 0;
 	unsigned long start_pfn, end_pfn;
 	unsigned long ret = 0;
 	unsigned long pos;
-
 	struct map_range mr[NR_RANGE_MR];
 	int nr_range, i;
-	int use_pse, use_gbpages;
 
 	printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
 	       start, end - 1);
 
-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
-	/*
-	 * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
-	 * This will simplify cpa(), which otherwise needs to support splitting
-	 * large pages into small in interrupt context, etc.
-	 */
-	use_pse = use_gbpages = 0;
-#else
-	use_pse = cpu_has_pse;
-	use_gbpages = direct_gbpages;
-#endif
-
-	/* Enable PSE if available */
-	if (cpu_has_pse)
-		set_in_cr4(X86_CR4_PSE);
-
-	/* Enable PGE if available */
-	if (cpu_has_pge) {
-		set_in_cr4(X86_CR4_PGE);
-		__supported_pte_mask |= _PAGE_GLOBAL;
-	}
-
-	if (use_gbpages)
-		page_size_mask |= 1 << PG_LEVEL_1G;
-	if (use_pse)
-		page_size_mask |= 1 << PG_LEVEL_2M;
-
 	memset(mr, 0, sizeof(mr));
 	nr_range = 0;
 
@@ -267,7 +263,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	 * nodes are discovered.
 	 */
 	if (!after_bootmem)
-		find_early_table_space(&mr[0], end, use_pse, use_gbpages);
+		find_early_table_space(&mr[0], end);
 
 	for (i = 0; i < nr_range; i++)
 		ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 02/13] x86, mm: Split out split_mem_range from init_memory_mapping
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
  2012-09-30  7:57 ` [PATCH 01/13] x86, mm: Add global page_size_mask and probe one time only Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 03/13] x86, mm: Move init_memory_mapping calling out of setup.c Yinghai Lu
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

So make init_memory_mapping smaller and readable.

Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/mm/init.c |   42 ++++++++++++++++++++++++++----------------
 1 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7903d54..818b881 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -143,25 +143,13 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
 	return nr_range;
 }
 
-/*
- * Setup the direct mapping of the physical memory at PAGE_OFFSET.
- * This runs before bootmem is initialized and gets pages directly from
- * the physical memory. To access them they are temporarily mapped.
- */
-unsigned long __init_refok init_memory_mapping(unsigned long start,
-					       unsigned long end)
+static int __meminit split_mem_range(struct map_range *mr, int nr_range,
+				     unsigned long start,
+				     unsigned long end)
 {
 	unsigned long start_pfn, end_pfn;
-	unsigned long ret = 0;
 	unsigned long pos;
-	struct map_range mr[NR_RANGE_MR];
-	int nr_range, i;
-
-	printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
-	       start, end - 1);
-
-	memset(mr, 0, sizeof(mr));
-	nr_range = 0;
+	int i;
 
 	/* head if not big page alignment ? */
 	start_pfn = start >> PAGE_SHIFT;
@@ -255,6 +243,28 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 			(mr[i].page_size_mask & (1<<PG_LEVEL_1G))?"1G":(
 			 (mr[i].page_size_mask & (1<<PG_LEVEL_2M))?"2M":"4k"));
 
+	return nr_range;
+}
+
+/*
+ * Setup the direct mapping of the physical memory at PAGE_OFFSET.
+ * This runs before bootmem is initialized and gets pages directly from
+ * the physical memory. To access them they are temporarily mapped.
+ */
+unsigned long __init_refok init_memory_mapping(unsigned long start,
+					       unsigned long end)
+{
+	struct map_range mr[NR_RANGE_MR];
+	unsigned long ret = 0;
+	int nr_range, i;
+
+	pr_info("init_memory_mapping: [mem %#010lx-%#010lx]\n",
+	       start, end - 1);
+
+	memset(mr, 0, sizeof(mr));
+	nr_range = 0;
+	nr_range = split_mem_range(mr, nr_range, start, end);
+
 	/*
 	 * Find space for the kernel direct mapping tables.
 	 *
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 03/13] x86, mm: Move init_memory_mapping calling out of setup.c
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
  2012-09-30  7:57 ` [PATCH 01/13] x86, mm: Add global page_size_mask and probe one time only Yinghai Lu
  2012-09-30  7:57 ` [PATCH 02/13] x86, mm: Split out split_mem_range from init_memory_mapping Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

Now init_memory_mapping is called two times, later will call more time for
more ram ranges.

Could put all related init_mem calling together.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/include/asm/init.h    |    1 -
 arch/x86/include/asm/pgtable.h |    2 +-
 arch/x86/kernel/setup.c        |   13 +------------
 arch/x86/mm/init.c             |   19 ++++++++++++++++++-
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index adcc0ae..4f13998 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -12,7 +12,6 @@ kernel_physical_mapping_init(unsigned long start,
 			     unsigned long end,
 			     unsigned long page_size_mask);
 
-
 extern unsigned long __initdata pgt_buf_start;
 extern unsigned long __meminitdata pgt_buf_end;
 extern unsigned long __meminitdata pgt_buf_top;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index c6f5779..52d40a1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -598,7 +598,7 @@ static inline int pgd_none(pgd_t pgd)
 #ifndef __ASSEMBLY__
 
 extern int direct_gbpages;
-void probe_page_size_mask(void);
+void init_mem_mapping(void);
 
 /* local pte updates need not use xchg for locking */
 static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 20581d7..249384a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -912,20 +912,9 @@ void __init setup_arch(char **cmdline_p)
 	setup_real_mode();
 
 	init_gbpages();
-	probe_page_size_mask();
 
-	/* max_pfn_mapped is updated here */
-	max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
-	max_pfn_mapped = max_low_pfn_mapped;
+	init_mem_mapping();
 
-#ifdef CONFIG_X86_64
-	if (max_pfn > max_low_pfn) {
-		max_pfn_mapped = init_memory_mapping(1UL<<32,
-						     max_pfn<<PAGE_SHIFT);
-		/* can we preseve max_low_pfn ?*/
-		max_low_pfn = max_pfn;
-	}
-#endif
 	memblock.current_limit = get_max_mapped();
 	dma_contiguous_reserve(0);
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 818b881..9f69180 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -92,7 +92,7 @@ static void __init find_early_table_space(struct map_range *mr,
 		(pgt_buf_top << PAGE_SHIFT) - 1);
 }
 
-void probe_page_size_mask(void)
+static void __init probe_page_size_mask(void)
 {
 #if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
 	/*
@@ -312,6 +312,23 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	return ret >> PAGE_SHIFT;
 }
 
+void __init init_mem_mapping(void)
+{
+	probe_page_size_mask();
+
+	/* max_pfn_mapped is updated here */
+	max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
+	max_pfn_mapped = max_low_pfn_mapped;
+
+#ifdef CONFIG_X86_64
+	if (max_pfn > max_low_pfn) {
+		max_pfn_mapped = init_memory_mapping(1UL<<32,
+						     max_pfn<<PAGE_SHIFT);
+		/* can we preseve max_low_pfn ?*/
+		max_low_pfn = max_pfn;
+	}
+#endif
+}
 
 /*
  * devmem_is_allowed() checks to see if /dev/mem access to a certain address
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (2 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 03/13] x86, mm: Move init_memory_mapping calling out of setup.c Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-10-01 11:00   ` Stefano Stabellini
  2012-09-30  7:57 ` [PATCH 05/13] x86, mm: Find early page table buffer altogether Yinghai Lu
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

After

| commit 8548c84da2f47e71bbbe300f55edb768492575f7
| Author: Takashi Iwai <tiwai@suse.de>
| Date:   Sun Oct 23 23:19:12 2011 +0200
|
|    x86: Fix S4 regression
|
|    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
|    regression since 2.6.39, namely the machine reboots occasionally at S4
|    resume.  It doesn't happen always, overall rate is about 1/20.  But,
|    like other bugs, once when this happens, it continues to happen.
|
|    This patch fixes the problem by essentially reverting the memory
|    assignment in the older way.

Have some page table around 512M again, that will prevent kdump to find 512M
under 768M.

We need revert that reverting, so we could put page table high again for 64bit.

Takashi agreed that S4 regression could be something else.

	https://lkml.org/lkml/2012/6/15/182

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/mm/init.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 9f69180..aadb154 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
 #ifdef CONFIG_X86_32
 	/* for fixmap */
 	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
-#endif
 	good_end = max_pfn_mapped << PAGE_SHIFT;
+#endif
 
 	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
 	if (!base)
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 05/13] x86, mm: Find early page table buffer altogether
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (3 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 06/13] x86, mm: Separate out calculate_table_space_size() Yinghai Lu
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

We should not do that in every calling of init_memory_mapping.

At the same time need to move down early_memtest, and could move after_bootmem
checking away.

-v2: fix one early_memtest with 32bit by passing max_pfn_mapped instead.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/mm/init.c |   72 ++++++++++++++++++++++++++-------------------------
 1 files changed, 37 insertions(+), 35 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index aadb154..d364f6a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -37,7 +37,7 @@ struct map_range {
 
 static int page_size_mask;
 
-static void __init find_early_table_space(struct map_range *mr,
+static void __init find_early_table_space(unsigned long begin,
 					  unsigned long end)
 {
 	unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
@@ -64,8 +64,8 @@ static void __init find_early_table_space(struct map_range *mr,
 		extra += PMD_SIZE;
 #endif
 		/* The first 2/4M doesn't use large pages. */
-		if (mr->start < PMD_SIZE)
-			extra += mr->end - mr->start;
+		if (begin < PMD_SIZE)
+			extra += (PMD_SIZE - begin) >> PAGE_SHIFT;
 
 		ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	} else
@@ -265,16 +265,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	nr_range = 0;
 	nr_range = split_mem_range(mr, nr_range, start, end);
 
-	/*
-	 * Find space for the kernel direct mapping tables.
-	 *
-	 * Later we should allocate these tables in the local node of the
-	 * memory mapped. Unfortunately this is done currently before the
-	 * nodes are discovered.
-	 */
-	if (!after_bootmem)
-		find_early_table_space(&mr[0], end);
-
 	for (i = 0; i < nr_range; i++)
 		ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
 						   mr[i].page_size_mask);
@@ -287,6 +277,36 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 
 	__flush_tlb_all();
 
+	return ret >> PAGE_SHIFT;
+}
+
+void __init init_mem_mapping(void)
+{
+	probe_page_size_mask();
+
+	/*
+	 * Find space for the kernel direct mapping tables.
+	 *
+	 * Later we should allocate these tables in the local node of the
+	 * memory mapped. Unfortunately this is done currently before the
+	 * nodes are discovered.
+	 */
+#ifdef CONFIG_X86_64
+	find_early_table_space(0, max_pfn<<PAGE_SHIFT);
+#else
+	find_early_table_space(0, max_low_pfn<<PAGE_SHIFT);
+#endif
+	max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
+	max_pfn_mapped = max_low_pfn_mapped;
+
+#ifdef CONFIG_X86_64
+	if (max_pfn > max_low_pfn) {
+		max_pfn_mapped = init_memory_mapping(1UL<<32,
+						     max_pfn<<PAGE_SHIFT);
+		/* can we preseve max_low_pfn ?*/
+		max_low_pfn = max_pfn;
+	}
+#endif
 	/*
 	 * Reserve the kernel pagetable pages we used (pgt_buf_start -
 	 * pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
@@ -302,32 +322,14 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	 * RO all the pagetable pages, including the ones that are beyond
 	 * pgt_buf_end at that time.
 	 */
-	if (!after_bootmem && pgt_buf_end > pgt_buf_start)
+	if (pgt_buf_end > pgt_buf_start)
 		x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
 				PFN_PHYS(pgt_buf_end));
 
-	if (!after_bootmem)
-		early_memtest(start, end);
+	/* stop the wrong using */
+	pgt_buf_top = 0;
 
-	return ret >> PAGE_SHIFT;
-}
-
-void __init init_mem_mapping(void)
-{
-	probe_page_size_mask();
-
-	/* max_pfn_mapped is updated here */
-	max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
-	max_pfn_mapped = max_low_pfn_mapped;
-
-#ifdef CONFIG_X86_64
-	if (max_pfn > max_low_pfn) {
-		max_pfn_mapped = init_memory_mapping(1UL<<32,
-						     max_pfn<<PAGE_SHIFT);
-		/* can we preseve max_low_pfn ?*/
-		max_low_pfn = max_pfn;
-	}
-#endif
+	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
 }
 
 /*
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 06/13] x86, mm: Separate out calculate_table_space_size()
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (4 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 05/13] x86, mm: Find early page table buffer altogether Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 07/13] x86, mm: Move down two calculate_table_space_size down Yinghai Lu
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

It should take physical address range that will need to be mapped.
find_early_table_space should take range that pgt buff should be in.

Separating page table size calculating and finding early page table will
reduce confusing.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/mm/init.c |   39 ++++++++++++++++++++++++++++-----------
 1 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d364f6a..dc05416 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -37,11 +37,10 @@ struct map_range {
 
 static int page_size_mask;
 
-static void __init find_early_table_space(unsigned long begin,
+static unsigned long __init calculate_table_space_size(unsigned long begin,
 					  unsigned long end)
 {
-	unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
-	phys_addr_t base;
+	unsigned long puds, pmds, ptes, tables;
 
 	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
 	tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
@@ -76,9 +75,17 @@ static void __init find_early_table_space(unsigned long begin,
 #ifdef CONFIG_X86_32
 	/* for fixmap */
 	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
-	good_end = max_pfn_mapped << PAGE_SHIFT;
 #endif
 
+	return tables;
+}
+
+static void __init find_early_table_space(unsigned long start,
+					  unsigned long good_end,
+					  unsigned long tables)
+{
+	phys_addr_t base;
+
 	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
 	if (!base)
 		panic("Cannot find space for the kernel page tables");
@@ -86,10 +93,6 @@ static void __init find_early_table_space(unsigned long begin,
 	pgt_buf_start = base >> PAGE_SHIFT;
 	pgt_buf_end = pgt_buf_start;
 	pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
-
-	printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx]\n",
-		end - 1, pgt_buf_start << PAGE_SHIFT,
-		(pgt_buf_top << PAGE_SHIFT) - 1);
 }
 
 static void __init probe_page_size_mask(void)
@@ -282,6 +285,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 
 void __init init_mem_mapping(void)
 {
+	unsigned long tables, good_end, end;
+
 	probe_page_size_mask();
 
 	/*
@@ -292,10 +297,18 @@ void __init init_mem_mapping(void)
 	 * nodes are discovered.
 	 */
 #ifdef CONFIG_X86_64
-	find_early_table_space(0, max_pfn<<PAGE_SHIFT);
+	end = max_pfn << PAGE_SHIFT;
+	good_end = end;
 #else
-	find_early_table_space(0, max_low_pfn<<PAGE_SHIFT);
+	end = max_low_pfn << PAGE_SHIFT;
+	good_end = max_pfn_mapped << PAGE_SHIFT;
 #endif
+	tables = calculate_table_space_size(0, end);
+	find_early_table_space(0, good_end, tables);
+	printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
+		end - 1, pgt_buf_start << PAGE_SHIFT,
+		(pgt_buf_top << PAGE_SHIFT) - 1);
+
 	max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
 	max_pfn_mapped = max_low_pfn_mapped;
 
@@ -322,9 +335,13 @@ void __init init_mem_mapping(void)
 	 * RO all the pagetable pages, including the ones that are beyond
 	 * pgt_buf_end at that time.
 	 */
-	if (pgt_buf_end > pgt_buf_start)
+	if (pgt_buf_end > pgt_buf_start) {
+		printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] final\n",
+			end - 1, pgt_buf_start << PAGE_SHIFT,
+			(pgt_buf_end << PAGE_SHIFT) - 1);
 		x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
 				PFN_PHYS(pgt_buf_end));
+	}
 
 	/* stop the wrong using */
 	pgt_buf_top = 0;
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 07/13] x86, mm: Move down two calculate_table_space_size down.
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (5 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 06/13] x86, mm: Separate out calculate_table_space_size() Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 08/13] x86, mm: Set memblock initial limit to 1M Yinghai Lu
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

So later could make it call split_mem_range.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/mm/init.c |  116 ++++++++++++++++++++++++++--------------------------
 1 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index dc05416..d4b40d4 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -37,64 +37,6 @@ struct map_range {
 
 static int page_size_mask;
 
-static unsigned long __init calculate_table_space_size(unsigned long begin,
-					  unsigned long end)
-{
-	unsigned long puds, pmds, ptes, tables;
-
-	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
-	tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
-
-	if (page_size_mask & (1 << PG_LEVEL_1G)) {
-		unsigned long extra;
-
-		extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
-		pmds = (extra + PMD_SIZE - 1) >> PMD_SHIFT;
-	} else
-		pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
-
-	tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
-
-	if (page_size_mask & (1 << PG_LEVEL_2M)) {
-		unsigned long extra;
-
-		extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
-#ifdef CONFIG_X86_32
-		extra += PMD_SIZE;
-#endif
-		/* The first 2/4M doesn't use large pages. */
-		if (begin < PMD_SIZE)
-			extra += (PMD_SIZE - begin) >> PAGE_SHIFT;
-
-		ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	} else
-		ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
-
-	tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
-
-#ifdef CONFIG_X86_32
-	/* for fixmap */
-	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
-#endif
-
-	return tables;
-}
-
-static void __init find_early_table_space(unsigned long start,
-					  unsigned long good_end,
-					  unsigned long tables)
-{
-	phys_addr_t base;
-
-	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
-	if (!base)
-		panic("Cannot find space for the kernel page tables");
-
-	pgt_buf_start = base >> PAGE_SHIFT;
-	pgt_buf_end = pgt_buf_start;
-	pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
-}
-
 static void __init probe_page_size_mask(void)
 {
 #if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
@@ -249,6 +191,64 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
 	return nr_range;
 }
 
+static unsigned long __init calculate_table_space_size(unsigned long begin,
+					  unsigned long end)
+{
+	unsigned long puds, pmds, ptes, tables;
+
+	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
+	tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
+
+	if (page_size_mask & (1 << PG_LEVEL_1G)) {
+		unsigned long extra;
+
+		extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
+		pmds = (extra + PMD_SIZE - 1) >> PMD_SHIFT;
+	} else
+		pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
+
+	tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
+
+	if (page_size_mask & (1 << PG_LEVEL_2M)) {
+		unsigned long extra;
+
+		extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
+#ifdef CONFIG_X86_32
+		extra += PMD_SIZE;
+#endif
+		/* The first 2/4M doesn't use large pages. */
+		if (begin < PMD_SIZE)
+			extra += (PMD_SIZE - begin) >> PAGE_SHIFT;
+
+		ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	} else
+		ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
+
+#ifdef CONFIG_X86_32
+	/* for fixmap */
+	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
+#endif
+
+	return tables;
+}
+
+static void __init find_early_table_space(unsigned long start,
+					  unsigned long good_end,
+					  unsigned long tables)
+{
+	phys_addr_t base;
+
+	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
+	if (!base)
+		panic("Cannot find space for the kernel page tables");
+
+	pgt_buf_start = base >> PAGE_SHIFT;
+	pgt_buf_end = pgt_buf_start;
+	pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
+}
+
 /*
  * Setup the direct mapping of the physical memory at PAGE_OFFSET.
  * This runs before bootmem is initialized and gets pages directly from
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 08/13] x86, mm: Set memblock initial limit to 1M
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (6 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 07/13] x86, mm: Move down two calculate_table_space_size down Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 09/13] x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix Yinghai Lu
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

memblock_x86_fill() could double memory array.
If we set max_pfn_mapped to 512M, so memory array could be around 512M.
So kdump will not get big range (like 512M) under 1024M.

Try to put it down under 1M, it could use about 4k or so.

Also we need this one when we try to only map ram range only later.
if the early double the range near 512M, and later init_mem_mapping()
first several range could under 512M, then after mapping get reset,
we will lose access the memblock memory array. but we are using
memblock memory array for iteration.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/kernel/setup.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 249384a..9db2922 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -889,7 +889,7 @@ void __init setup_arch(char **cmdline_p)
 
 	cleanup_highmap();
 
-	memblock.current_limit = get_max_mapped();
+	memblock.current_limit = ISA_END_ADDRESS;
 	memblock_x86_fill();
 
 	/*
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 09/13] x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (7 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 08/13] x86, mm: Set memblock initial limit to 1M Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 10/13] x86: Fixup code testing if a pfn is direct mapped Yinghai Lu
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

From: Jacob Shin <jacob.shin@amd.com>

There could be cases where user supplied memmap=exactmap memory
mappings do not mark the region where the kernel .text .data and
.bss reside as E820_RAM, as reported here:

https://lkml.org/lkml/2012/8/14/86

Handle it by complaining, and adding the range back into the e820.

Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/kernel/setup.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 9db2922..d3da223 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -831,6 +831,20 @@ void __init setup_arch(char **cmdline_p)
 	insert_resource(&iomem_resource, &data_resource);
 	insert_resource(&iomem_resource, &bss_resource);
 
+	/*
+	 * Complain if .text .data and .bss are not marked as E820_RAM and
+	 * attempt to fix it by adding the range. We may have a confused BIOS,
+	 * or the user may have incorrectly supplied it via memmap=exactmap. If
+	 * we really are running on top non-RAM, we will crash later anyways.
+	 */
+	if (!e820_all_mapped(code_resource.start, __pa(__brk_limit), E820_RAM)) {
+		pr_warn(".text .data .bss are not marked as E820_RAM!\n");
+
+		e820_add_region(code_resource.start,
+				__pa(__brk_limit) - code_resource.start + 1,
+				E820_RAM);
+	}
+
 	trim_bios_range();
 #ifdef CONFIG_X86_32
 	if (ppro_with_ram_bug()) {
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 10/13] x86: Fixup code testing if a pfn is direct mapped
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (8 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 09/13] x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 11/13] x86: Only direct map addresses that are marked as E820_RAM Yinghai Lu
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

From: Jacob Shin <jacob.shin@amd.com>

Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
pfn_mapped ranges instead.


-v2: change applying sequence to keep git bisecting working.
     so add dummy pfn_range_is_mapped(). - Yinghai Lu

Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/include/asm/page_types.h |    8 ++++++++
 arch/x86/kernel/cpu/amd.c         |    8 +++-----
 arch/x86/platform/efi/efi.c       |    8 ++++----
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index e21fdd1..45aae6e 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -51,6 +51,14 @@ static inline phys_addr_t get_max_mapped(void)
 	return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
 }
 
+static inline bool pfn_range_is_mapped(unsigned long start_pfn,
+					unsigned long end_pfn)
+{
+	return end_pfn <= max_low_pfn_mapped ||
+	       (end_pfn > (1UL << (32 - PAGE_SHIFT)) &&
+		end_pfn <= max_pfn_mapped);
+}
+
 extern unsigned long init_memory_mapping(unsigned long start,
 					 unsigned long end);
 
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index f7e98a2..9619ba6 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -676,12 +676,10 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
 		 * benefit in doing so.
 		 */
 		if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, &tseg)) {
+			unsigned long pfn = tseg >> PAGE_SHIFT;
+
 			printk(KERN_DEBUG "tseg: %010llx\n", tseg);
-			if ((tseg>>PMD_SHIFT) <
-				(max_low_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) ||
-				((tseg>>PMD_SHIFT) <
-				(max_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) &&
-				(tseg>>PMD_SHIFT) >= (1ULL<<(32 - PMD_SHIFT))))
+			if (pfn_range_is_mapped(pfn, pfn + 1))
 				set_memory_4k((unsigned long)__va(tseg), 1);
 		}
 	}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index f8a30da..4e5320c 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -823,7 +823,7 @@ void __init efi_enter_virtual_mode(void)
 	efi_memory_desc_t *md, *prev_md = NULL;
 	efi_status_t status;
 	unsigned long size;
-	u64 end, systab, addr, npages, end_pfn;
+	u64 end, systab, addr, npages, start_pfn, end_pfn;
 	void *p, *va, *new_memmap = NULL;
 	int count = 0;
 
@@ -876,10 +876,10 @@ void __init efi_enter_virtual_mode(void)
 		size = md->num_pages << EFI_PAGE_SHIFT;
 		end = md->phys_addr + size;
 
+		start_pfn = PFN_DOWN(md->phys_addr);
 		end_pfn = PFN_UP(end);
-		if (end_pfn <= max_low_pfn_mapped
-		    || (end_pfn > (1UL << (32 - PAGE_SHIFT))
-			&& end_pfn <= max_pfn_mapped))
+
+		if (pfn_range_is_mapped(start_pfn, end_pfn))
 			va = __va(md->phys_addr);
 		else
 			va = efi_ioremap(md->phys_addr, size, md->type);
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 11/13] x86: Only direct map addresses that are marked as E820_RAM
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (9 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 10/13] x86: Fixup code testing if a pfn is direct mapped Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 12/13] x86/mm: calculate_table_space_size based on memory ranges that are being mapped Yinghai Lu
  2012-09-30  7:57 ` [PATCH 13/13] x86, mm: Use func pointer to table size calculation and mapping Yinghai Lu
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

From: Jacob Shin <jacob.shin@amd.com>

Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

Our system with 1TB of RAM has an e820 that looks like this:

 BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
 BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
 BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
 BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
 BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
 BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
 BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
 BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
 BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
 BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
 BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
 BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
 BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

and so direct mappings are created for huge memory hole between
0x000000e038000000 to 0x0000010000000000. Even though the kernel never
generates memory accesses in that region, since the page tables mark
them incorrectly as being WB, our (AMD) processor ends up causing a MCE
while doing some memory bookkeeping/optimizations around that area.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges. Depending on
the alignment of E820 ranges, this may possibly result in using smaller
size (i.e. 4K instead of 2M or 1G) page tables.

-v2: move changes from setup.c to mm/init.c, also use for_each_mem_pfn_range
	instead.  - Yinghai Lu
-v3: add calculate_all_table_space_size() to get correct needed page table
	size. - Yinghai Lu

Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/include/asm/page_types.h |    8 +--
 arch/x86/kernel/setup.c           |    8 ++-
 arch/x86/mm/init.c                |  119 +++++++++++++++++++++++++++++++++----
 arch/x86/mm/init_64.c             |    6 +-
 4 files changed, 116 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 45aae6e..54c9787 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -51,13 +51,7 @@ static inline phys_addr_t get_max_mapped(void)
 	return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
 }
 
-static inline bool pfn_range_is_mapped(unsigned long start_pfn,
-					unsigned long end_pfn)
-{
-	return end_pfn <= max_low_pfn_mapped ||
-	       (end_pfn > (1UL << (32 - PAGE_SHIFT)) &&
-		end_pfn <= max_pfn_mapped);
-}
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
 
 extern unsigned long init_memory_mapping(unsigned long start,
 					 unsigned long end);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3da223..4989f80 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,9 +115,11 @@
 #include <asm/prom.h>
 
 /*
- * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
- * The direct mapping extends to max_pfn_mapped, so that we can directly access
- * apertures, ACPI and other tables without having to play with fixmaps.
+ * max_low_pfn_mapped: highest direct mapped pfn under 4GB
+ * max_pfn_mapped:     highest direct mapped pfn over 4GB
+ *
+ * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
+ * represented by pfn_mapped
  */
 unsigned long max_low_pfn_mapped;
 unsigned long max_pfn_mapped;
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d4b40d4..3237c9b 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -234,6 +234,38 @@ static unsigned long __init calculate_table_space_size(unsigned long begin,
 	return tables;
 }
 
+static unsigned long __init calculate_all_table_space_size(void)
+{
+	unsigned long start_pfn, end_pfn;
+	unsigned long tables;
+	int i;
+
+	/* the ISA range is always mapped regardless of memory holes */
+	tables = calculate_table_space_size(0, ISA_END_ADDRESS);
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+		u64 start = start_pfn << PAGE_SHIFT;
+		u64 end = end_pfn << PAGE_SHIFT;
+
+		if (end <= ISA_END_ADDRESS)
+			continue;
+
+		if (start < ISA_END_ADDRESS)
+			start = ISA_END_ADDRESS;
+#ifdef CONFIG_X86_32
+		/* on 32 bit, we only map up to max_low_pfn */
+		if ((start >> PAGE_SHIFT) >= max_low_pfn)
+			continue;
+
+		if ((end >> PAGE_SHIFT) > max_low_pfn)
+			end = max_low_pfn << PAGE_SHIFT;
+#endif
+		tables += calculate_table_space_size(start, end);
+	}
+
+	return tables;
+}
+
 static void __init find_early_table_space(unsigned long start,
 					  unsigned long good_end,
 					  unsigned long tables)
@@ -249,6 +281,33 @@ static void __init find_early_table_space(unsigned long start,
 	pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
 }
 
+static struct range pfn_mapped[E820_X_MAX];
+static int nr_pfn_mapped;
+
+static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+	nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
+					     nr_pfn_mapped, start_pfn, end_pfn);
+	nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
+
+	max_pfn_mapped = max(max_pfn_mapped, end_pfn);
+
+	if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
+		max_low_pfn_mapped = max(max_low_pfn_mapped, end_pfn);
+}
+
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+	int i;
+
+	for (i = 0; i < nr_pfn_mapped; i++)
+		if ((start_pfn >= pfn_mapped[i].start) &&
+		    (end_pfn <= pfn_mapped[i].end))
+			return true;
+
+	return false;
+}
+
 /*
  * Setup the direct mapping of the physical memory at PAGE_OFFSET.
  * This runs before bootmem is initialized and gets pages directly from
@@ -280,9 +339,55 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 
 	__flush_tlb_all();
 
+	add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
+
 	return ret >> PAGE_SHIFT;
 }
 
+/*
+ * Iterate through E820 memory map and create direct mappings for only E820_RAM
+ * regions. We cannot simply create direct mappings for all pfns from
+ * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
+ * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
+ * Depending on the alignment of E820 ranges, this may possibly result in using
+ * smaller size (i.e. 4K instead of 2M or 1G) page tables.
+ */
+static void __init init_all_memory_mapping(void)
+{
+	unsigned long start_pfn, end_pfn;
+	int i;
+
+	/* the ISA range is always mapped regardless of memory holes */
+	init_memory_mapping(0, ISA_END_ADDRESS);
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+		u64 start = start_pfn << PAGE_SHIFT;
+		u64 end = end_pfn << PAGE_SHIFT;
+
+		if (end <= ISA_END_ADDRESS)
+			continue;
+
+		if (start < ISA_END_ADDRESS)
+			start = ISA_END_ADDRESS;
+#ifdef CONFIG_X86_32
+		/* on 32 bit, we only map up to max_low_pfn */
+		if ((start >> PAGE_SHIFT) >= max_low_pfn)
+			continue;
+
+		if ((end >> PAGE_SHIFT) > max_low_pfn)
+			end = max_low_pfn << PAGE_SHIFT;
+#endif
+		init_memory_mapping(start, end);
+	}
+
+#ifdef CONFIG_X86_64
+	if (max_pfn > max_low_pfn) {
+		/* can we preseve max_low_pfn ?*/
+		max_low_pfn = max_pfn;
+	}
+#endif
+}
+
 void __init init_mem_mapping(void)
 {
 	unsigned long tables, good_end, end;
@@ -303,23 +408,15 @@ void __init init_mem_mapping(void)
 	end = max_low_pfn << PAGE_SHIFT;
 	good_end = max_pfn_mapped << PAGE_SHIFT;
 #endif
-	tables = calculate_table_space_size(0, end);
+	tables = calculate_all_table_space_size();
 	find_early_table_space(0, good_end, tables);
 	printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
 		end - 1, pgt_buf_start << PAGE_SHIFT,
 		(pgt_buf_top << PAGE_SHIFT) - 1);
 
-	max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
-	max_pfn_mapped = max_low_pfn_mapped;
+	max_pfn_mapped = 0; /* will get exact value next */
+	init_all_memory_mapping();
 
-#ifdef CONFIG_X86_64
-	if (max_pfn > max_low_pfn) {
-		max_pfn_mapped = init_memory_mapping(1UL<<32,
-						     max_pfn<<PAGE_SHIFT);
-		/* can we preseve max_low_pfn ?*/
-		max_low_pfn = max_pfn;
-	}
-#endif
 	/*
 	 * Reserve the kernel pagetable pages we used (pgt_buf_start -
 	 * pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..ab558eb 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -657,13 +657,11 @@ int arch_add_memory(int nid, u64 start, u64 size)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
-	unsigned long last_mapped_pfn, start_pfn = start >> PAGE_SHIFT;
+	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	last_mapped_pfn = init_memory_mapping(start, start + size);
-	if (last_mapped_pfn > max_pfn_mapped)
-		max_pfn_mapped = last_mapped_pfn;
+	init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 12/13] x86/mm: calculate_table_space_size based on memory ranges that are being mapped
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (10 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 11/13] x86: Only direct map addresses that are marked as E820_RAM Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  2012-09-30  7:57 ` [PATCH 13/13] x86, mm: Use func pointer to table size calculation and mapping Yinghai Lu
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

From: Jacob Shin <jacob.shin@amd.com>

Current logic finds enough space for direct mapping page tables from 0
to end. Instead, we only need to find enough space to cover mr[0].start
to mr[nr_range].end -- the range that is actually being mapped by
init_memory_mapping()

This patch also reportedly fixes suspend/resume issue reported in:

https://lkml.org/lkml/2012/8/11/83

-v2: update with calculate_table_space_size()
     clear max_pfn_mapped before init_all_memory_mapping to get right value
					  -Yinghai Lu

Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/mm/init.c |   51 ++++++++++++++++++++++++++++++---------------------
 1 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 3237c9b..c12dfd5 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -191,39 +191,48 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
 	return nr_range;
 }
 
-static unsigned long __init calculate_table_space_size(unsigned long begin,
+static unsigned long __init calculate_table_space_size(unsigned long start,
 					  unsigned long end)
 {
-	unsigned long puds, pmds, ptes, tables;
+	unsigned long puds = 0, pmds = 0, ptes = 0, tables;
+	struct map_range mr[NR_RANGE_MR];
+	int nr_range, i;
 
-	puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
-	tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
+	pr_info("calculate_table_space_size: [mem %#010lx-%#010lx]\n",
+	       start, end - 1);
 
-	if (page_size_mask & (1 << PG_LEVEL_1G)) {
-		unsigned long extra;
+	memset(mr, 0, sizeof(mr));
+	nr_range = 0;
+	nr_range = split_mem_range(mr, nr_range, start, end);
 
-		extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
-		pmds = (extra + PMD_SIZE - 1) >> PMD_SHIFT;
-	} else
-		pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
+	for (i = 0; i < nr_range; i++) {
+		unsigned long range, extra;
 
-	tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
+		range = mr[i].end - mr[i].start;
+		puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;
 
-	if (page_size_mask & (1 << PG_LEVEL_2M)) {
-		unsigned long extra;
+		if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
+			extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
+			pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
+		} else
+			pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
 
-		extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
+		if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
+			extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
 #ifdef CONFIG_X86_32
-		extra += PMD_SIZE;
+			extra += PMD_SIZE;
 #endif
-		/* The first 2/4M doesn't use large pages. */
-		if (begin < PMD_SIZE)
-			extra += (PMD_SIZE - begin) >> PAGE_SHIFT;
+			/* The first 2/4M doesn't use large pages. */
+			if (mr[i].start < PMD_SIZE)
+				extra += PMD_SIZE - mr[i].start;
 
-		ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
-	} else
-		ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
+			ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		} else
+			ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	}
 
+	tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
+	tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
 	tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
 
 #ifdef CONFIG_X86_32
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 13/13] x86, mm: Use func pointer to table size calculation and mapping
  2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
                   ` (11 preceding siblings ...)
  2012-09-30  7:57 ` [PATCH 12/13] x86/mm: calculate_table_space_size based on memory ranges that are being mapped Yinghai Lu
@ 2012-09-30  7:57 ` Yinghai Lu
  12 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-09-30  7:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: linux-kernel, Yinghai Lu

They all need to go over ram range in same sequence. So add shared function
to reduce duplicated code.

-v2: Change to walk_ram_ranges() according to Pekka Enberg.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
---
 arch/x86/mm/init.c |   64 ++++++++++++++++++---------------------------------
 1 files changed, 23 insertions(+), 41 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index c12dfd5..cf662ba 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -243,14 +243,15 @@ static unsigned long __init calculate_table_space_size(unsigned long start,
 	return tables;
 }
 
-static unsigned long __init calculate_all_table_space_size(void)
+static void __init walk_ram_ranges(
+			void (*work_fn)(unsigned long, unsigned long, void *),
+			void *data)
 {
 	unsigned long start_pfn, end_pfn;
-	unsigned long tables;
 	int i;
 
 	/* the ISA range is always mapped regardless of memory holes */
-	tables = calculate_table_space_size(0, ISA_END_ADDRESS);
+	work_fn(0, ISA_END_ADDRESS, data);
 
 	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
 		u64 start = start_pfn << PAGE_SHIFT;
@@ -269,10 +270,15 @@ static unsigned long __init calculate_all_table_space_size(void)
 		if ((end >> PAGE_SHIFT) > max_low_pfn)
 			end = max_low_pfn << PAGE_SHIFT;
 #endif
-		tables += calculate_table_space_size(start, end);
+		work_fn(start, end, data);
 	}
+}
 
-	return tables;
+static void __init size_work_fn(unsigned long start, unsigned long end, void *data)
+{
+	unsigned long *size = data;
+
+	*size += calculate_table_space_size(start, end);
 }
 
 static void __init find_early_table_space(unsigned long start,
@@ -361,45 +367,15 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
  * Depending on the alignment of E820 ranges, this may possibly result in using
  * smaller size (i.e. 4K instead of 2M or 1G) page tables.
  */
-static void __init init_all_memory_mapping(void)
+static void __init mapping_work_fn(unsigned long start, unsigned long end,
+					 void *data)
 {
-	unsigned long start_pfn, end_pfn;
-	int i;
-
-	/* the ISA range is always mapped regardless of memory holes */
-	init_memory_mapping(0, ISA_END_ADDRESS);
-
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
-		u64 start = start_pfn << PAGE_SHIFT;
-		u64 end = end_pfn << PAGE_SHIFT;
-
-		if (end <= ISA_END_ADDRESS)
-			continue;
-
-		if (start < ISA_END_ADDRESS)
-			start = ISA_END_ADDRESS;
-#ifdef CONFIG_X86_32
-		/* on 32 bit, we only map up to max_low_pfn */
-		if ((start >> PAGE_SHIFT) >= max_low_pfn)
-			continue;
-
-		if ((end >> PAGE_SHIFT) > max_low_pfn)
-			end = max_low_pfn << PAGE_SHIFT;
-#endif
-		init_memory_mapping(start, end);
-	}
-
-#ifdef CONFIG_X86_64
-	if (max_pfn > max_low_pfn) {
-		/* can we preseve max_low_pfn ?*/
-		max_low_pfn = max_pfn;
-	}
-#endif
+	init_memory_mapping(start, end);
 }
 
 void __init init_mem_mapping(void)
 {
-	unsigned long tables, good_end, end;
+	unsigned long tables = 0, good_end, end;
 
 	probe_page_size_mask();
 
@@ -417,15 +393,21 @@ void __init init_mem_mapping(void)
 	end = max_low_pfn << PAGE_SHIFT;
 	good_end = max_pfn_mapped << PAGE_SHIFT;
 #endif
-	tables = calculate_all_table_space_size();
+	walk_ram_ranges(size_work_fn, &tables);
 	find_early_table_space(0, good_end, tables);
 	printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
 		end - 1, pgt_buf_start << PAGE_SHIFT,
 		(pgt_buf_top << PAGE_SHIFT) - 1);
 
 	max_pfn_mapped = 0; /* will get exact value next */
-	init_all_memory_mapping();
+	walk_ram_ranges(mapping_work_fn, NULL);
 
+#ifdef CONFIG_X86_64
+	if (max_pfn > max_low_pfn) {
+		/* can we preseve max_low_pfn ?*/
+		max_low_pfn = max_pfn;
+	}
+#endif
 	/*
 	 * Reserve the kernel pagetable pages we used (pgt_buf_start -
 	 * pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-09-30  7:57 ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
@ 2012-10-01 11:00   ` Stefano Stabellini
  2012-10-03 16:51     ` Jacob Shin
  2012-10-04 15:57     ` Yinghai Lu
  0 siblings, 2 replies; 57+ messages in thread
From: Stefano Stabellini @ 2012-10-01 11:00 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin,
	Tejun Heo, linux-kernel, Konrad Rzeszutek Wilk

On Sun, 30 Sep 2012, Yinghai Lu wrote:
> After
> 
> | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> | Author: Takashi Iwai <tiwai@suse.de>
> | Date:   Sun Oct 23 23:19:12 2011 +0200
> |
> |    x86: Fix S4 regression
> |
> |    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> |    regression since 2.6.39, namely the machine reboots occasionally at S4
> |    resume.  It doesn't happen always, overall rate is about 1/20.  But,
> |    like other bugs, once when this happens, it continues to happen.
> |
> |    This patch fixes the problem by essentially reverting the memory
> |    assignment in the older way.
> 
> Have some page table around 512M again, that will prevent kdump to find 512M
> under 768M.
> 
> We need revert that reverting, so we could put page table high again for 64bit.
> 
> Takashi agreed that S4 regression could be something else.
> 
> 	https://lkml.org/lkml/2012/6/15/182
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  arch/x86/mm/init.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 9f69180..aadb154 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
>  #ifdef CONFIG_X86_32
>  	/* for fixmap */
>  	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> -#endif
>  	good_end = max_pfn_mapped << PAGE_SHIFT;
> +#endif
>  
>  	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
>  	if (!base)

Isn't this going to cause init_memory_mapping to allocate pagetable
pages from memory not yet mapped?
Last time I spoke with HPA and Thomas about this, they seem to agree
that it isn't a very good idea.
Also, it is proven to cause a certain amount of headaches on Xen,
see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-01 11:00   ` Stefano Stabellini
@ 2012-10-03 16:51     ` Jacob Shin
  2012-10-03 18:34       ` H. Peter Anvin
                         ` (2 more replies)
  2012-10-04 15:57     ` Yinghai Lu
  1 sibling, 3 replies; 57+ messages in thread
From: Jacob Shin @ 2012-10-03 16:51 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Yinghai Lu, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Tejun Heo, linux-kernel, Konrad Rzeszutek Wilk

On Mon, Oct 01, 2012 at 12:00:26PM +0100, Stefano Stabellini wrote:
> On Sun, 30 Sep 2012, Yinghai Lu wrote:
> > After
> > 
> > | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> > | Author: Takashi Iwai <tiwai@suse.de>
> > | Date:   Sun Oct 23 23:19:12 2011 +0200
> > |
> > |    x86: Fix S4 regression
> > |
> > |    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> > |    regression since 2.6.39, namely the machine reboots occasionally at S4
> > |    resume.  It doesn't happen always, overall rate is about 1/20.  But,
> > |    like other bugs, once when this happens, it continues to happen.
> > |
> > |    This patch fixes the problem by essentially reverting the memory
> > |    assignment in the older way.
> > 
> > Have some page table around 512M again, that will prevent kdump to find 512M
> > under 768M.
> > 
> > We need revert that reverting, so we could put page table high again for 64bit.
> > 
> > Takashi agreed that S4 regression could be something else.
> > 
> > 	https://lkml.org/lkml/2012/6/15/182
> > 
> > Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> > ---
> >  arch/x86/mm/init.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 9f69180..aadb154 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> >  #ifdef CONFIG_X86_32
> >  	/* for fixmap */
> >  	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> > -#endif
> >  	good_end = max_pfn_mapped << PAGE_SHIFT;
> > +#endif
> >  
> >  	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> >  	if (!base)
> 
> Isn't this going to cause init_memory_mapping to allocate pagetable
> pages from memory not yet mapped?
> Last time I spoke with HPA and Thomas about this, they seem to agree
> that it isn't a very good idea.
> Also, it is proven to cause a certain amount of headaches on Xen,
> see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
> 

Any comments, thoughts? hpa? Yinghai?

So it seems that during init_memory_mapping Xen needs to modify page table 
bits and the memory where the page tables live needs to be direct mapped at
that time.

Since we now call init_memory_mapping for every E820_RAM range sequencially,
the only way to satisfy Xen is to find_early_page_table_space (good_end needs
to be within memory already mapped at the time) for every init_memory_mapping
call.

What do you think Yinghai?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-03 16:51     ` Jacob Shin
@ 2012-10-03 18:34       ` H. Peter Anvin
  2012-10-04 13:56       ` Konrad Rzeszutek Wilk
  2012-10-04 16:19       ` Yinghai Lu
  2 siblings, 0 replies; 57+ messages in thread
From: H. Peter Anvin @ 2012-10-03 18:34 UTC (permalink / raw)
  To: Jacob Shin
  Cc: Stefano Stabellini, Yinghai Lu, Thomas Gleixner, Ingo Molnar,
	Tejun Heo, linux-kernel, Konrad Rzeszutek Wilk

On 10/03/2012 09:51 AM, Jacob Shin wrote:
>
> Any comments, thoughts? hpa? Yinghai?
>
> So it seems that during init_memory_mapping Xen needs to modify page table
> bits and the memory where the page tables live needs to be direct mapped at
> that time.
>
> Since we now call init_memory_mapping for every E820_RAM range sequencially,
> the only way to satisfy Xen is to find_early_page_table_space (good_end needs
> to be within memory already mapped at the time) for every init_memory_mapping
> call.
>
> What do you think Yinghai?
>

I outlined the sane way to do this at Kernel Summit for Yinghai and 
several other people.  I need to write it up for people who weren't 
there, but I don't have time right at the moment.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-03 16:51     ` Jacob Shin
  2012-10-03 18:34       ` H. Peter Anvin
@ 2012-10-04 13:56       ` Konrad Rzeszutek Wilk
  2012-10-04 21:52         ` H. Peter Anvin
  2012-10-04 16:19       ` Yinghai Lu
  2 siblings, 1 reply; 57+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-04 13:56 UTC (permalink / raw)
  To: Jacob Shin
  Cc: Stefano Stabellini, Yinghai Lu, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Tejun Heo, linux-kernel, Konrad Rzeszutek Wilk

On Wed, Oct 03, 2012 at 11:51:06AM -0500, Jacob Shin wrote:
> On Mon, Oct 01, 2012 at 12:00:26PM +0100, Stefano Stabellini wrote:
> > On Sun, 30 Sep 2012, Yinghai Lu wrote:
> > > After
> > > 
> > > | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> > > | Author: Takashi Iwai <tiwai@suse.de>
> > > | Date:   Sun Oct 23 23:19:12 2011 +0200
> > > |
> > > |    x86: Fix S4 regression
> > > |
> > > |    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> > > |    regression since 2.6.39, namely the machine reboots occasionally at S4
> > > |    resume.  It doesn't happen always, overall rate is about 1/20.  But,
> > > |    like other bugs, once when this happens, it continues to happen.
> > > |
> > > |    This patch fixes the problem by essentially reverting the memory
> > > |    assignment in the older way.
> > > 
> > > Have some page table around 512M again, that will prevent kdump to find 512M
> > > under 768M.
> > > 
> > > We need revert that reverting, so we could put page table high again for 64bit.
> > > 
> > > Takashi agreed that S4 regression could be something else.
> > > 
> > > 	https://lkml.org/lkml/2012/6/15/182
> > > 
> > > Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> > > ---
> > >  arch/x86/mm/init.c |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > > index 9f69180..aadb154 100644
> > > --- a/arch/x86/mm/init.c
> > > +++ b/arch/x86/mm/init.c
> > > @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> > >  #ifdef CONFIG_X86_32
> > >  	/* for fixmap */
> > >  	tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> > > -#endif
> > >  	good_end = max_pfn_mapped << PAGE_SHIFT;
> > > +#endif
> > >  
> > >  	base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> > >  	if (!base)
> > 
> > Isn't this going to cause init_memory_mapping to allocate pagetable
> > pages from memory not yet mapped?
> > Last time I spoke with HPA and Thomas about this, they seem to agree
> > that it isn't a very good idea.
> > Also, it is proven to cause a certain amount of headaches on Xen,
> > see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
> > 
> 
> Any comments, thoughts? hpa? Yinghai?
> 
> So it seems that during init_memory_mapping Xen needs to modify page table 
> bits and the memory where the page tables live needs to be direct mapped at
> that time.

That is not exactly true. I am not sure if we are just using the wrong
words for it - so let me try to write up what the impediment is.

There is also this talk between Stefano and tglrx that can help in
getting ones' head around it: https://lkml.org/lkml/2012/8/24/335

The restriction that Xen places on Linux page-tables is that they MUST
be read-only when in usage. Meaning if you creating a PTE table (or PMD,
PUD, etc), you can write to it as long as you want - but the moment you
hook it up to a live page-table - it must be marked RO (so the PMD entry
cannot have _PAGE_RW on it). Easy enough.

This means that if we are re-using a pagetable during the
init_memory_mapping (so we iomap it), we need to iomap it with
!_PAGE_RW) - and that is where xen_set_pte_init has a check for
is_early_ioremap_ptep. To add to the fun, the pagetables are expanding -
so as one is ioremapping/iounmaping, you have to check the pgt_buf_end
to check whether the page table we are mapping is within the:
 pgt_buf_start -> pgt_buf_end <- pgt_buf_top

(and pgt_buf_end can increment up to pgt_buf_top).

Now the next part of this that is hard to wrap around is when you
want to create a PTE entries for the pgt_buf_start -> pgt_buf_end.
Its double fun, b/c your pgt_buf_end can increment as you are
trying to create those PTE entries - and you _MUST_ mark those
PTE entries as RO. This is b/c those pagetables (pgt_buf_start ->
pgt_buf_end) are live and only Xen can touch them.

This feels like operating on a live patient, while said patient
is running the marathon. Only duct-tape expert can apply for
this position.


What Peter had in mind is a nice system where we get rid of
this linear allocation of page-tables (so pgt_buf_start -> pgt_buf
_end are linearly allocated). His thinking (and Peter if I mess
up please correct me), is that we can stick the various pagetables
in different spots in memory. Mainly that as we look at mapping
a region (say 0GB->1GB), we look at in chunks (2MB?) and allocate
a page-table at the _end_ of the newly mapped chunk if we have
filled all entries in said pagetable.

For simplicity, lets say we are just dealing with PTE tables and
we are mapping the region 0GB->1GB with 4KB pages.

First we stick a page-table (or if there is a found one reuse it)
at the start of the region (so 0-2MB).

0MB.......................2MB
/-----\
|PTE_A|
\-----/

The PTE entries in it will cover 0->2MB (PTE table #A) and once it is
finished, it will stick a new pagetable at the end of the 2MB region:

0MB.......................2MB...........................4MB
/-----\                /-----\
|PTE_A|                |PTE_B|
\-----/                \-----/


The PTE_B page table will be used to map 2MB->4MB.

Once that is finished .. we repeat the cycle.

That should remove the utter duct-tape madness and make this a lot
easier.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-01 11:00   ` Stefano Stabellini
  2012-10-03 16:51     ` Jacob Shin
@ 2012-10-04 15:57     ` Yinghai Lu
  2012-10-04 16:45       ` Konrad Rzeszutek Wilk
  2012-10-05 10:47       ` Stefano Stabellini
  1 sibling, 2 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-10-04 15:57 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin,
	Tejun Heo, linux-kernel, Konrad Rzeszutek Wilk

On Mon, Oct 1, 2012 at 4:00 AM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Sun, 30 Sep 2012, Yinghai Lu wrote:
>> After
>>
>> | commit 8548c84da2f47e71bbbe300f55edb768492575f7
>> | Author: Takashi Iwai <tiwai@suse.de>
>> | Date:   Sun Oct 23 23:19:12 2011 +0200
>> |
>> |    x86: Fix S4 regression
>> |
>> |    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
>> |    regression since 2.6.39, namely the machine reboots occasionally at S4
>> |    resume.  It doesn't happen always, overall rate is about 1/20.  But,
>> |    like other bugs, once when this happens, it continues to happen.
>> |
>> |    This patch fixes the problem by essentially reverting the memory
>> |    assignment in the older way.
>>
>> Have some page table around 512M again, that will prevent kdump to find 512M
>> under 768M.
>>
>> We need revert that reverting, so we could put page table high again for 64bit.
>>
>> Takashi agreed that S4 regression could be something else.
>>
>>       https://lkml.org/lkml/2012/6/15/182
>>
>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>> ---
>>  arch/x86/mm/init.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 9f69180..aadb154 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
>>  #ifdef CONFIG_X86_32
>>       /* for fixmap */
>>       tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
>> -#endif
>>       good_end = max_pfn_mapped << PAGE_SHIFT;
>> +#endif
>>
>>       base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
>>       if (!base)
>
> Isn't this going to cause init_memory_mapping to allocate pagetable
> pages from memory not yet mapped?

but 64bit is using ioremap to access those page table buf.

> Last time I spoke with HPA and Thomas about this, they seem to agree
> that it isn't a very good idea.
> Also, it is proven to cause a certain amount of headaches on Xen,
> see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.

this patchset will allocate page table buf one time only.
So could use ram under 1M to map that page table at first.

so that will make it xen happy ?

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-03 16:51     ` Jacob Shin
  2012-10-03 18:34       ` H. Peter Anvin
  2012-10-04 13:56       ` Konrad Rzeszutek Wilk
@ 2012-10-04 16:19       ` Yinghai Lu
  2012-10-04 16:46         ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-04 16:19 UTC (permalink / raw)
  To: Jacob Shin
  Cc: Stefano Stabellini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Tejun Heo, linux-kernel, Konrad Rzeszutek Wilk

On Wed, Oct 3, 2012 at 9:51 AM, Jacob Shin <jacob.shin@amd.com> wrote:
> Any comments, thoughts? hpa? Yinghai?
>
> So it seems that during init_memory_mapping Xen needs to modify page table
> bits and the memory where the page tables live needs to be direct mapped at
> that time.
>
> Since we now call init_memory_mapping for every E820_RAM range sequencially,
> the only way to satisfy Xen is to find_early_page_table_space (good_end needs
> to be within memory already mapped at the time) for every init_memory_mapping
> call.
>
> What do you think Yinghai?

that may put the page table on near end of every ram range for next
memory range.

then kdump may have problem get big range again.

Yinghai

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 15:57     ` Yinghai Lu
@ 2012-10-04 16:45       ` Konrad Rzeszutek Wilk
  2012-10-04 21:21         ` Yinghai Lu
  2012-10-05 10:47       ` Stefano Stabellini
  1 sibling, 1 reply; 57+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-04 16:45 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Stefano Stabellini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Jacob Shin, Tejun Heo, linux-kernel

On Thu, Oct 04, 2012 at 08:57:55AM -0700, Yinghai Lu wrote:
> On Mon, Oct 1, 2012 at 4:00 AM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > On Sun, 30 Sep 2012, Yinghai Lu wrote:
> >> After
> >>
> >> | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> >> | Author: Takashi Iwai <tiwai@suse.de>
> >> | Date:   Sun Oct 23 23:19:12 2011 +0200
> >> |
> >> |    x86: Fix S4 regression
> >> |
> >> |    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> >> |    regression since 2.6.39, namely the machine reboots occasionally at S4
> >> |    resume.  It doesn't happen always, overall rate is about 1/20.  But,
> >> |    like other bugs, once when this happens, it continues to happen.
> >> |
> >> |    This patch fixes the problem by essentially reverting the memory
> >> |    assignment in the older way.
> >>
> >> Have some page table around 512M again, that will prevent kdump to find 512M
> >> under 768M.
> >>
> >> We need revert that reverting, so we could put page table high again for 64bit.
> >>
> >> Takashi agreed that S4 regression could be something else.
> >>
> >>       https://lkml.org/lkml/2012/6/15/182
> >>
> >> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> >> ---
> >>  arch/x86/mm/init.c |    2 +-
> >>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> >> index 9f69180..aadb154 100644
> >> --- a/arch/x86/mm/init.c
> >> +++ b/arch/x86/mm/init.c
> >> @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> >>  #ifdef CONFIG_X86_32
> >>       /* for fixmap */
> >>       tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> >> -#endif
> >>       good_end = max_pfn_mapped << PAGE_SHIFT;
> >> +#endif
> >>
> >>       base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> >>       if (!base)
> >
> > Isn't this going to cause init_memory_mapping to allocate pagetable
> > pages from memory not yet mapped?
> 
> but 64bit is using ioremap to access those page table buf.
> 
> > Last time I spoke with HPA and Thomas about this, they seem to agree
> > that it isn't a very good idea.
> > Also, it is proven to cause a certain amount of headaches on Xen,
> > see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
> 
> this patchset will allocate page table buf one time only.

As in, if your machine has 8GB, it will allocate pagetables that
span 0->8GB at once?

> So could use ram under 1M to map that page table at first.

Could or does this patch do it? And why 1MB?
> 
> so that will make it xen happy ?

The issues that Xen faces are purely due to the fact that they
must be RO when they are in use. I believe (and without actually
checking it just to make sure) that it does not matter where
the page-tables are located. But with the current generic code
the location is quite linear: it starts with pgt_buf_start and
goes up to pgt_buf_top. So how would this patch move the location
of the page-table to be under 1MB?

Perhaps we are talking about seperate topics?

My recollection of memblock_find_in_range is that it will try
the end of the range to find a suitable "chunk" that satisfies
the 'size' and 'aligment' parameters?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 16:19       ` Yinghai Lu
@ 2012-10-04 16:46         ` Konrad Rzeszutek Wilk
  2012-10-04 21:29           ` Yinghai Lu
  0 siblings, 1 reply; 57+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-04 16:46 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jacob Shin, Stefano Stabellini, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Tejun Heo, linux-kernel

On Thu, Oct 04, 2012 at 09:19:08AM -0700, Yinghai Lu wrote:
> On Wed, Oct 3, 2012 at 9:51 AM, Jacob Shin <jacob.shin@amd.com> wrote:
> > Any comments, thoughts? hpa? Yinghai?
> >
> > So it seems that during init_memory_mapping Xen needs to modify page table
> > bits and the memory where the page tables live needs to be direct mapped at
> > that time.
> >
> > Since we now call init_memory_mapping for every E820_RAM range sequencially,
> > the only way to satisfy Xen is to find_early_page_table_space (good_end needs
> > to be within memory already mapped at the time) for every init_memory_mapping
> > call.
> >
> > What do you think Yinghai?
> 
> that may put the page table on near end of every ram range for next
> memory range.
> 
> then kdump may have problem get big range again.

Is there a git commit that explains what the 'big range' problem is?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 16:45       ` Konrad Rzeszutek Wilk
@ 2012-10-04 21:21         ` Yinghai Lu
  2012-10-04 21:40           ` Yinghai Lu
  0 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-04 21:21 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Jacob Shin, Tejun Heo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 903 bytes --]

On Thu, Oct 4, 2012 at 9:45 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> So could use ram under 1M to map that page table at first.
>
> Could or does this patch do it? And why 1MB?

can you or stefano could test attached patch on xen ?

that will map the page table buffer that will be used.

under 1M, still 4k there, so there will be no page table around 512M.

>>
>> so that will make it xen happy ?
>
> The issues that Xen faces are purely due to the fact that they
> must be RO when they are in use. I believe (and without actually
> checking it just to make sure) that it does not matter where
> the page-tables are located. But with the current generic code
> the location is quite linear: it starts with pgt_buf_start and
> goes up to pgt_buf_top. So how would this patch move the location
> of the page-table to be under 1MB?

just page table buf's page table.

Thanks

Yinghai

[-- Attachment #2: fix_xen_with_init_mapping.patch --]
[-- Type: application/octet-stream, Size: 2263 bytes --]

---
 arch/x86/mm/init.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -359,6 +359,39 @@ unsigned long __init_refok init_memory_m
 	return ret >> PAGE_SHIFT;
 }
 
+static long __init early_init_memory_mapping(unsigned long start,
+					       unsigned long end)
+{
+	unsigned long ret;
+	unsigned long tables, good_end;
+
+	pr_info("init_memory_mapping: [mem %#010lx-%#010lx]\n",
+	       start, end - 1);
+
+	tables = calculate_table_space_size(start, end);
+	good_end = ISA_END_ADDRESS;
+
+	find_early_table_space(0, good_end, tables);
+
+	printk(KERN_DEBUG "kernel direct mapping tables from %#lx to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
+		start, end - 1, pgt_buf_start << PAGE_SHIFT,
+		(pgt_buf_top << PAGE_SHIFT) - 1);
+
+	ret = init_memory_mapping(start, end);
+
+	if (pgt_buf_end > pgt_buf_start) {
+		printk(KERN_DEBUG "kernel direct mapping tables from %#lx to %#lx @ [mem %#010lx-%#010lx] final\n",
+			start, end - 1, pgt_buf_start << PAGE_SHIFT,
+			(pgt_buf_end << PAGE_SHIFT) - 1);
+		x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
+				PFN_PHYS(pgt_buf_end));
+	}
+
+	early_memtest(start, end);
+
+	return ret;
+}
+
 /*
  * Iterate through E820 memory map and create direct mappings for only E820_RAM
  * regions. We cannot simply create direct mappings for all pfns from
@@ -395,6 +428,20 @@ void __init init_mem_mapping(void)
 #endif
 	walk_ram_ranges(size_work_fn, &tables);
 	find_early_table_space(0, good_end, tables);
+
+	if (pgt_buf_top >= max_pfn_mapped) {
+		unsigned long old_pgt_buf_start = pgt_buf_start;
+		unsigned long old_pgt_buf_end = pgt_buf_end;
+		unsigned long old_pgt_buf_top = pgt_buf_top;
+
+		early_init_memory_mapping(old_pgt_buf_start << PAGE_SHIFT,
+					  old_pgt_buf_top << PAGE_SHIFT);
+
+		pgt_buf_start = old_pgt_buf_start;
+		pgt_buf_end = old_pgt_buf_end;
+		pgt_buf_top = old_pgt_buf_top;
+	}
+
 	printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
 		end - 1, pgt_buf_start << PAGE_SHIFT,
 		(pgt_buf_top << PAGE_SHIFT) - 1);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 16:46         ` Konrad Rzeszutek Wilk
@ 2012-10-04 21:29           ` Yinghai Lu
  2012-10-05 21:04             ` Eric W. Biederman
  0 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-04 21:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jacob Shin, Stefano Stabellini, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Tejun Heo, linux-kernel

On Thu, Oct 4, 2012 at 9:46 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> then kdump may have problem get big range again.
>
> Is there a git commit that explains what the 'big range' problem is?

commit 7f8595bfacef279f06c82ec98d420ef54f2537e0
Author: H. Peter Anvin <hpa@linux.intel.com>
Date:   Thu Dec 16 19:20:41 2010 -0800

    x86, kexec: Limit the crashkernel address appropriately

    Keep the crash kernel address below 512 MiB for 32 bits and 896 MiB
    for 64 bits.  For 32 bits, this retains compatibility with earlier
    kernel releases, and makes it work even if the vmalloc= setting is
    adjusted.

    For 64 bits, we should be able to increase this substantially once a
    hard-coded limit in kexec-tools is fixed.

    Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Cc: Stanislaw Gruszka <sgruszka@redhat.com>
    Cc: Yinghai Lu <yinghai@kernel.org>
    LKML-Reference: <20101217195035.GE14502@redhat.com>

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 21c6746..c9089a1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -501,7 +501,18 @@ static inline unsigned long long get_total_mem(void)
        return total << PAGE_SHIFT;
 }

-#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
+/*
+ * Keep the crash kernel below this limit.  On 32 bits earlier kernels
+ * would limit the kernel to the low 512 MiB due to mapping restrictions.
+ * On 64 bits, kexec-tools currently limits us to 896 MiB; increase this
+ * limit once kexec-tools are fixed.
+ */
+#ifdef CONFIG_X86_32
+# define CRASH_KERNEL_ADDR_MAX (512 << 20)
+#else
+# define CRASH_KERNEL_ADDR_MAX (896 << 20)
+#endif
+
 static void __init reserve_crashkernel(void)
 {
        unsigned long long total_mem;
@@ -520,10 +531,10 @@ static void __init reserve_crashkernel(void)
                const unsigned long long alignment = 16<<20;    /* 16M */

                /*
-                *  kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
+                *  kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
                 */
                crash_base = memblock_find_in_range(alignment,
-                              DEFAULT_BZIMAGE_ADDR_MAX, crash_size, alignment);
+                              CRASH_KERNEL_ADDR_MAX, crash_size, alignment);

                if (crash_base == MEMBLOCK_ERROR) {
                        pr_info("crashkernel reservation failed - No
suitable area found.\n");

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 21:21         ` Yinghai Lu
@ 2012-10-04 21:40           ` Yinghai Lu
  2012-10-04 21:41             ` H. Peter Anvin
  0 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-04 21:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Stefano Stabellini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Jacob Shin, Tejun Heo, linux-kernel

On Thu, Oct 4, 2012 at 2:21 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Thu, Oct 4, 2012 at 9:45 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>>> So could use ram under 1M to map that page table at first.
>>
>> Could or does this patch do it? And why 1MB?
>
> can you or stefano could test attached patch on xen ?
>
on top of

http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=shortlog;h=refs/heads/x86/mm2

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git  x86/mm2

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 21:40           ` Yinghai Lu
@ 2012-10-04 21:41             ` H. Peter Anvin
  2012-10-04 21:46               ` Yinghai Lu
  0 siblings, 1 reply; 57+ messages in thread
From: H. Peter Anvin @ 2012-10-04 21:41 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Konrad Rzeszutek Wilk, Stefano Stabellini, Thomas Gleixner,
	Ingo Molnar, Jacob Shin, Tejun Heo, linux-kernel

On 10/04/2012 02:40 PM, Yinghai Lu wrote:
> On Thu, Oct 4, 2012 at 2:21 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> On Thu, Oct 4, 2012 at 9:45 AM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>>>> So could use ram under 1M to map that page table at first.
>>>
>>> Could or does this patch do it? And why 1MB?
>>
>> can you or stefano could test attached patch on xen ?
>>
> on top of
> 
> http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=shortlog;h=refs/heads/x86/mm2
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git  x86/mm2
> 

No, no, not yet another ad hoc hack... please.

	-hpa



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 21:41             ` H. Peter Anvin
@ 2012-10-04 21:46               ` Yinghai Lu
  2012-10-04 21:54                 ` H. Peter Anvin
  0 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-04 21:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Konrad Rzeszutek Wilk, Stefano Stabellini, Thomas Gleixner,
	Ingo Molnar, Jacob Shin, Tejun Heo, linux-kernel

On Thu, Oct 4, 2012 at 2:41 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 10/04/2012 02:40 PM, Yinghai Lu wrote:
>> On Thu, Oct 4, 2012 at 2:21 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>> On Thu, Oct 4, 2012 at 9:45 AM, Konrad Rzeszutek Wilk
>>> <konrad.wilk@oracle.com> wrote:
>>>>> So could use ram under 1M to map that page table at first.
>>>>
>>>> Could or does this patch do it? And why 1MB?
>>>
>>> can you or stefano could test attached patch on xen ?
>>>
>> on top of
>>
>> http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=shortlog;h=refs/heads/x86/mm2
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git  x86/mm2
>>
>
> No, no, not yet another ad hoc hack... please.
>

or let xen map that page table by itself at first?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 13:56       ` Konrad Rzeszutek Wilk
@ 2012-10-04 21:52         ` H. Peter Anvin
  0 siblings, 0 replies; 57+ messages in thread
From: H. Peter Anvin @ 2012-10-04 21:52 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jacob Shin, Stefano Stabellini, Yinghai Lu, Thomas Gleixner,
	Ingo Molnar, Tejun Heo, linux-kernel, Konrad Rzeszutek Wilk

On 10/04/2012 06:56 AM, Konrad Rzeszutek Wilk wrote:
>
> What Peter had in mind is a nice system where we get rid of
> this linear allocation of page-tables (so pgt_buf_start -> pgt_buf
> _end are linearly allocated). His thinking (and Peter if I mess
> up please correct me), is that we can stick the various pagetables
> in different spots in memory. Mainly that as we look at mapping
> a region (say 0GB->1GB), we look at in chunks (2MB?) and allocate
> a page-table at the _end_ of the newly mapped chunk if we have
> filled all entries in said pagetable.
>
> For simplicity, lets say we are just dealing with PTE tables and
> we are mapping the region 0GB->1GB with 4KB pages.
>
> First we stick a page-table (or if there is a found one reuse it)
> at the start of the region (so 0-2MB).
>
> 0MB.......................2MB
> /-----\
> |PTE_A|
> \-----/
>
> The PTE entries in it will cover 0->2MB (PTE table #A) and once it is
> finished, it will stick a new pagetable at the end of the 2MB region:
>
> 0MB.......................2MB...........................4MB
> /-----\                /-----\
> |PTE_A|                |PTE_B|
> \-----/                \-----/
>
>
> The PTE_B page table will be used to map 2MB->4MB.
>
> Once that is finished .. we repeat the cycle.
>
> That should remove the utter duct-tape madness and make this a lot
> easier.
>

You got the basic idea right but the details slightly wrong.  Let me try 
to explain.

When we start up, we know we have a set of page tables which maps the 
kernel text, data, bss and brk.  This is set up by the startup code on 
native and by the domain builder on Xen.

We can reserve an arbitrary chunk of brk that is (a) big enough to map 
the kernel text+data+bss+brk itself plus (b) some arbitrary additional 
chunk of memory (perhaps we reserve another 256K of brk or so, enough to 
map 128 MB in the worst case of 4K PAE pages.)

Step 1:

- Create page table mappings for kernel text+data+bss+brk out of the
   brk region.

Step 2:

- Start creating mappings for the topmost memory region downward, until
   the brk reserved area is exhaused.

Step 3:

- Call a paravirt hook on the page tables created so far.  On native
   this does nothing, on Xen it can map it readonly and tell the
   hypervisor it is a page table.

Step 4:

- Switch to the newly created page table.  The bootup page table is now
   obsolete.

Step 5:

- Moving downward from the last address mapped, create new page tables
   for any additional unmapped memory region until either we run out of
   unmapped memory regions, or we run out of mapped memory for
   the memory regions to map.

Step 6:

- Call the paravirt hook for the new page tables, then add them to the
   page table tree.

Step 7:

- Repeat from step 5 until there are no more unmapped memory regions.


This:

a) removes any need to guesstimate how much page tables are going to
    consume.  We simply construct them; they may not be contiguous but
    that's okay.

b) very cleanly solves the Xen problem of not wanting to status-flip
    pages any more than necessary.


The only reason for moving downward rather than upward is that we want 
the page tables as high as possible in memory, since memory at low 
addresses is precious (for stupid DMA devices, for things like 
kexec/kdump, and so on.)

	-hpa





-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 21:46               ` Yinghai Lu
@ 2012-10-04 21:54                 ` H. Peter Anvin
  2012-10-05  7:46                   ` Yinghai Lu
  0 siblings, 1 reply; 57+ messages in thread
From: H. Peter Anvin @ 2012-10-04 21:54 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Konrad Rzeszutek Wilk, Stefano Stabellini, Thomas Gleixner,
	Ingo Molnar, Jacob Shin, Tejun Heo, linux-kernel

On 10/04/2012 02:46 PM, Yinghai Lu wrote:
>
> or let xen map that page table by itself at first?
>

See my other post.  This is bringing up the Kernel Summit algorithm again.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 21:54                 ` H. Peter Anvin
@ 2012-10-05  7:46                   ` Yinghai Lu
  2012-10-05 11:27                     ` Stefano Stabellini
  0 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-05  7:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Konrad Rzeszutek Wilk, Stefano Stabellini, Thomas Gleixner,
	Ingo Molnar, Jacob Shin, Tejun Heo, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 238 bytes --]

On Thu, Oct 4, 2012 at 2:54 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> See my other post.  This is bringing up the Kernel Summit algorithm again.
>

sure. please check if you are ok with attached one on top of x86/mm2

Thanks

Yinghai

[-- Attachment #2: fix_max_pfn_xx_11.patch --]
[-- Type: application/octet-stream, Size: 5011 bytes --]

Subject: [PATCH] x86: get early page table from BRK

set pgt_buf early from BRK, and use it to map page table at first.

also use the left at first, then use new extend one.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/include/asm/init.h    |    4 ++++
 arch/x86/include/asm/pgtable.h |    1 +
 arch/x86/kernel/setup.c        |    2 ++
 arch/x86/mm/init.c             |   23 +++++++++++++++++++++++
 arch/x86/mm/init_32.c          |    8 ++++++--
 arch/x86/mm/init_64.c          |    8 ++++++--
 6 files changed, 42 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/include/asm/pgtable.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable.h
+++ linux-2.6/arch/x86/include/asm/pgtable.h
@@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
 
 extern int direct_gbpages;
 void init_mem_mapping(void);
+void early_alloc_pgt_buf(void);
 
 /* local pte updates need not use xchg for locking */
 static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -950,6 +950,8 @@ void __init setup_arch(char **cmdline_p)
 
 	reserve_ibft_region();
 
+	early_alloc_pgt_buf();
+
 	/*
 	 * Need to conclude brk, before memblock_x86_fill()
 	 *  it could use memblock_find_in_range, could overlap with
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
 unsigned long __meminitdata pgt_buf_end;
 unsigned long __meminitdata pgt_buf_top;
 
+unsigned long __initdata early_pgt_buf_start;
+unsigned long __meminitdata early_pgt_buf_end;
+unsigned long __meminitdata early_pgt_buf_top;
+
 int after_bootmem;
 
 int direct_gbpages
@@ -291,6 +295,11 @@ static void __init find_early_table_spac
 	if (!base)
 		panic("Cannot find space for the kernel page tables");
 
+	init_memory_mapping(base, base + tables);
+	printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
+		base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
+		(early_pgt_buf_end << PAGE_SHIFT) - 1);
+
 	pgt_buf_start = base >> PAGE_SHIFT;
 	pgt_buf_end = pgt_buf_start;
 	pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
@@ -437,6 +446,20 @@ void __init init_mem_mapping(void)
 	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
 }
 
+RESERVE_BRK(early_pgt_alloc, 16384);
+
+void  __init early_alloc_pgt_buf(void)
+{
+	unsigned long tables = 13864;
+	phys_addr_t base;
+
+	base = __pa(extend_brk(tables, PAGE_SIZE));
+
+	early_pgt_buf_start = base >> PAGE_SHIFT;
+	early_pgt_buf_end = early_pgt_buf_start;
+	early_pgt_buf_top = early_pgt_buf_start + (tables >> PAGE_SHIFT);
+}
+
 /*
  * devmem_is_allowed() checks to see if /dev/mem access to a certain address
  * is valid. The argument is a physical page number.
Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -61,10 +61,14 @@ bool __read_mostly __vmalloc_start_set =
 
 static __init void *alloc_low_page(void)
 {
-	unsigned long pfn = pgt_buf_end++;
+	unsigned long pfn;
 	void *adr;
 
-	if (pfn >= pgt_buf_top)
+	if (early_pgt_buf_end < early_pgt_buf_top)
+		pfn = early_pgt_buf_end++;
+	else if (pgt_buf_end < pgt_buf_top)
+		pfn = pgt_buf_end++;
+	else
 		panic("alloc_low_page: ran out of memory");
 
 	adr = __va(pfn * PAGE_SIZE);
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -318,7 +318,7 @@ void __init cleanup_highmap(void)
 
 static __ref void *alloc_low_page(unsigned long *phys)
 {
-	unsigned long pfn = pgt_buf_end++;
+	unsigned long pfn;
 	void *adr;
 
 	if (after_bootmem) {
@@ -328,7 +328,11 @@ static __ref void *alloc_low_page(unsign
 		return adr;
 	}
 
-	if (pfn >= pgt_buf_top)
+	if (early_pgt_buf_end < early_pgt_buf_top)
+		pfn = early_pgt_buf_end++;
+	else if (pgt_buf_end < pgt_buf_top)
+		pfn = pgt_buf_end++;
+	else
 		panic("alloc_low_page: ran out of memory");
 
 	adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
Index: linux-2.6/arch/x86/include/asm/init.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/init.h
+++ linux-2.6/arch/x86/include/asm/init.h
@@ -16,4 +16,8 @@ extern unsigned long __initdata pgt_buf_
 extern unsigned long __meminitdata pgt_buf_end;
 extern unsigned long __meminitdata pgt_buf_top;
 
+extern unsigned long __initdata early_pgt_buf_start;
+extern unsigned long __meminitdata early_pgt_buf_end;
+extern unsigned long __meminitdata early_pgt_buf_top;
+
 #endif /* _ASM_X86_INIT_32_H */

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 15:57     ` Yinghai Lu
  2012-10-04 16:45       ` Konrad Rzeszutek Wilk
@ 2012-10-05 10:47       ` Stefano Stabellini
  1 sibling, 0 replies; 57+ messages in thread
From: Stefano Stabellini @ 2012-10-05 10:47 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Stefano Stabellini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Jacob Shin, Tejun Heo, linux-kernel, Konrad Rzeszutek Wilk

On Thu, 4 Oct 2012, Yinghai Lu wrote:
> On Mon, Oct 1, 2012 at 4:00 AM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > On Sun, 30 Sep 2012, Yinghai Lu wrote:
> >> After
> >>
> >> | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> >> | Author: Takashi Iwai <tiwai@suse.de>
> >> | Date:   Sun Oct 23 23:19:12 2011 +0200
> >> |
> >> |    x86: Fix S4 regression
> >> |
> >> |    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> >> |    regression since 2.6.39, namely the machine reboots occasionally at S4
> >> |    resume.  It doesn't happen always, overall rate is about 1/20.  But,
> >> |    like other bugs, once when this happens, it continues to happen.
> >> |
> >> |    This patch fixes the problem by essentially reverting the memory
> >> |    assignment in the older way.
> >>
> >> Have some page table around 512M again, that will prevent kdump to find 512M
> >> under 768M.
> >>
> >> We need revert that reverting, so we could put page table high again for 64bit.
> >>
> >> Takashi agreed that S4 regression could be something else.
> >>
> >>       https://lkml.org/lkml/2012/6/15/182
> >>
> >> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> >> ---
> >>  arch/x86/mm/init.c |    2 +-
> >>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> >> index 9f69180..aadb154 100644
> >> --- a/arch/x86/mm/init.c
> >> +++ b/arch/x86/mm/init.c
> >> @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> >>  #ifdef CONFIG_X86_32
> >>       /* for fixmap */
> >>       tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> >> -#endif
> >>       good_end = max_pfn_mapped << PAGE_SHIFT;
> >> +#endif
> >>
> >>       base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> >>       if (!base)
> >
> > Isn't this going to cause init_memory_mapping to allocate pagetable
> > pages from memory not yet mapped?
> 
> but 64bit is using ioremap to access those page table buf.

Yes, but as Konrad explained, the mapping should be RO or RW on Xen
depending on whether the pagetable pages are already hooked into the
pagetable or not. ioremap could be called on not-yet-hooked pagetable
pages (or non-pagetable-pages) and already-hooked pagetable pages and
they need to be marked differently. Finally when you are going to map
the region in memory that contains the pagetable pages, the entire range
needs to be marked RO.

These issues could be avoided if the pagetable pages were allocated from
a memory region that is already mapped and we had a proper pvop to warn
the Xen subsystem that the pagetable pages are about to be hooked into
the live pagetable.


> > Last time I spoke with HPA and Thomas about this, they seem to agree
> > that it isn't a very good idea.
> > Also, it is proven to cause a certain amount of headaches on Xen,
> > see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
> 
> this patchset will allocate page table buf one time only.
> So could use ram under 1M to map that page table at first.

I don't have anything agaist the goal of this series, the problem is
just where these pagetable pages come from.


> so that will make it xen happy ?

It would greatly simplify our life if the pagetable pages were
allocated from memory already mapped.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05  7:46                   ` Yinghai Lu
@ 2012-10-05 11:27                     ` Stefano Stabellini
  2012-10-05 14:58                       ` Yinghai Lu
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Stabellini @ 2012-10-05 11:27 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Konrad Rzeszutek Wilk, Stefano Stabellini,
	Thomas Gleixner, Ingo Molnar, Jacob Shin, Tejun Heo,
	linux-kernel

On Fri, 5 Oct 2012, Yinghai Lu wrote:
> On Thu, Oct 4, 2012 at 2:54 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > See my other post.  This is bringing up the Kernel Summit algorithm again.
> >
> 
> sure. please check if you are ok with attached one on top of x86/mm2
> 
> Subject: [PATCH] x86: get early page table from BRK
> 
> set pgt_buf early from BRK, and use it to map page table at first.
> 
> also use the left at first, then use new extend one.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>

If I read the patch correctly, it wouldn't actually change the pagetable
allocation flow or implement Peter's suggestion.
However it would pre-map (pgt_buf_start-pgt_buf_top) using brk memory.

So, if that's correct, we could remove the early_memremap call from
alloc_low_page and map_low_page, right?

Also the patch introduces an additional range of pagetable pages
(early_pgt_buf_start-early_pgt_buf_top) and that would need to be
communicated somehow to Xen (that at the moment assumes that the range
is pgt_buf_start-pgt_buf_top. Pretty bad).

Overall I still prefer Peter's suggestion :)


> ---
>  arch/x86/include/asm/init.h    |    4 ++++
>  arch/x86/include/asm/pgtable.h |    1 +
>  arch/x86/kernel/setup.c        |    2 ++
>  arch/x86/mm/init.c             |   23 +++++++++++++++++++++++
>  arch/x86/mm/init_32.c          |    8 ++++++--
>  arch/x86/mm/init_64.c          |    8 ++++++--
>  6 files changed, 42 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/arch/x86/include/asm/pgtable.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/pgtable.h
> +++ linux-2.6/arch/x86/include/asm/pgtable.h
> @@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
>  
>  extern int direct_gbpages;
>  void init_mem_mapping(void);
> +void early_alloc_pgt_buf(void);
>  
>  /* local pte updates need not use xchg for locking */
>  static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
> Index: linux-2.6/arch/x86/kernel/setup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/setup.c
> +++ linux-2.6/arch/x86/kernel/setup.c
> @@ -950,6 +950,8 @@ void __init setup_arch(char **cmdline_p)
>  
>  	reserve_ibft_region();
>  
> +	early_alloc_pgt_buf();
> +
>  	/*
>  	 * Need to conclude brk, before memblock_x86_fill()
>  	 *  it could use memblock_find_in_range, could overlap with
> Index: linux-2.6/arch/x86/mm/init.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init.c
> +++ linux-2.6/arch/x86/mm/init.c
> @@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
>  unsigned long __meminitdata pgt_buf_end;
>  unsigned long __meminitdata pgt_buf_top;
>  
> +unsigned long __initdata early_pgt_buf_start;
> +unsigned long __meminitdata early_pgt_buf_end;
> +unsigned long __meminitdata early_pgt_buf_top;
> +
>  int after_bootmem;
>  
>  int direct_gbpages
> @@ -291,6 +295,11 @@ static void __init find_early_table_spac
>  	if (!base)
>  		panic("Cannot find space for the kernel page tables");
>  
> +	init_memory_mapping(base, base + tables);
> +	printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
> +		base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
> +		(early_pgt_buf_end << PAGE_SHIFT) - 1);
> +
>  	pgt_buf_start = base >> PAGE_SHIFT;
>  	pgt_buf_end = pgt_buf_start;
>  	pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> @@ -437,6 +446,20 @@ void __init init_mem_mapping(void)
>  	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
>  }
>  
> +RESERVE_BRK(early_pgt_alloc, 16384);
> +
> +void  __init early_alloc_pgt_buf(void)
> +{
> +	unsigned long tables = 13864;
> +	phys_addr_t base;
> +
> +	base = __pa(extend_brk(tables, PAGE_SIZE));
> +
> +	early_pgt_buf_start = base >> PAGE_SHIFT;
> +	early_pgt_buf_end = early_pgt_buf_start;
> +	early_pgt_buf_top = early_pgt_buf_start + (tables >> PAGE_SHIFT);
> +}
> +
>  /*
>   * devmem_is_allowed() checks to see if /dev/mem access to a certain address
>   * is valid. The argument is a physical page number.
> Index: linux-2.6/arch/x86/mm/init_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_32.c
> +++ linux-2.6/arch/x86/mm/init_32.c
> @@ -61,10 +61,14 @@ bool __read_mostly __vmalloc_start_set =
>  
>  static __init void *alloc_low_page(void)
>  {
> -	unsigned long pfn = pgt_buf_end++;
> +	unsigned long pfn;
>  	void *adr;
>  
> -	if (pfn >= pgt_buf_top)
> +	if (early_pgt_buf_end < early_pgt_buf_top)
> +		pfn = early_pgt_buf_end++;
> +	else if (pgt_buf_end < pgt_buf_top)
> +		pfn = pgt_buf_end++;
> +	else
>  		panic("alloc_low_page: ran out of memory");
>  
>  	adr = __va(pfn * PAGE_SIZE);
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -318,7 +318,7 @@ void __init cleanup_highmap(void)
>  
>  static __ref void *alloc_low_page(unsigned long *phys)
>  {
> -	unsigned long pfn = pgt_buf_end++;
> +	unsigned long pfn;
>  	void *adr;
>  
>  	if (after_bootmem) {
> @@ -328,7 +328,11 @@ static __ref void *alloc_low_page(unsign
>  		return adr;
>  	}
>  
> -	if (pfn >= pgt_buf_top)
> +	if (early_pgt_buf_end < early_pgt_buf_top)
> +		pfn = early_pgt_buf_end++;
> +	else if (pgt_buf_end < pgt_buf_top)
> +		pfn = pgt_buf_end++;
> +	else
>  		panic("alloc_low_page: ran out of memory");
>  
>  	adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 11:27                     ` Stefano Stabellini
@ 2012-10-05 14:58                       ` Yinghai Lu
  2012-10-06  7:44                         ` [PATCH 0/3] x86: pre mapping page table to make xen happy Yinghai Lu
  2012-10-08  6:36                         ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
  0 siblings, 2 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-10-05 14:58 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: H. Peter Anvin, Konrad Rzeszutek Wilk, Thomas Gleixner,
	Ingo Molnar, Jacob Shin, Tejun Heo, linux-kernel

On Fri, Oct 5, 2012 at 4:27 AM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Fri, 5 Oct 2012, Yinghai Lu wrote:
>> On Thu, Oct 4, 2012 at 2:54 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> >
>> > See my other post.  This is bringing up the Kernel Summit algorithm again.
>> >
>>
>> sure. please check if you are ok with attached one on top of x86/mm2
>>
>> Subject: [PATCH] x86: get early page table from BRK
>>
>> set pgt_buf early from BRK, and use it to map page table at first.
>>
>> also use the left at first, then use new extend one.
>>
>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>
> If I read the patch correctly, it wouldn't actually change the pagetable
> allocation flow or implement Peter's suggestion.
> However it would pre-map (pgt_buf_start-pgt_buf_top) using brk memory.
>
> So, if that's correct, we could remove the early_memremap call from
> alloc_low_page and map_low_page, right?
>
> Also the patch introduces an additional range of pagetable pages
> (early_pgt_buf_start-early_pgt_buf_top) and that would need to be
> communicated somehow to Xen (that at the moment assumes that the range
> is pgt_buf_start-pgt_buf_top. Pretty bad).

that will need extra two lines for xen:
@@ -430,6 +439,8 @@ void __init init_mem_mapping(void)
 		x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
 				PFN_PHYS(pgt_buf_end));
 	}
+	x86_init.mapping.pagetable_reserve(PFN_PHYS(early_pgt_buf_start),
+					PFN_PHYS(early_pgt_buf_end));

 	/* stop the wrong using */
 	pgt_buf_top = 0;


>
> Overall I still prefer Peter's suggestion :)
>
>
>> ---
>>  arch/x86/include/asm/init.h    |    4 ++++
>>  arch/x86/include/asm/pgtable.h |    1 +
>>  arch/x86/kernel/setup.c        |    2 ++
>>  arch/x86/mm/init.c             |   23 +++++++++++++++++++++++
>>  arch/x86/mm/init_32.c          |    8 ++++++--
>>  arch/x86/mm/init_64.c          |    8 ++++++--
>>  6 files changed, 42 insertions(+), 4 deletions(-)
>>
>> Index: linux-2.6/arch/x86/include/asm/pgtable.h
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/include/asm/pgtable.h
>> +++ linux-2.6/arch/x86/include/asm/pgtable.h
>> @@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
>>
>>  extern int direct_gbpages;
>>  void init_mem_mapping(void);
>> +void early_alloc_pgt_buf(void);
>>
>>  /* local pte updates need not use xchg for locking */
>>  static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
>> Index: linux-2.6/arch/x86/kernel/setup.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/setup.c
>> +++ linux-2.6/arch/x86/kernel/setup.c
>> @@ -950,6 +950,8 @@ void __init setup_arch(char **cmdline_p)
>>
>>       reserve_ibft_region();
>>
>> +     early_alloc_pgt_buf();
>> +
>>       /*
>>        * Need to conclude brk, before memblock_x86_fill()
>>        *  it could use memblock_find_in_range, could overlap with
>> Index: linux-2.6/arch/x86/mm/init.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init.c
>> +++ linux-2.6/arch/x86/mm/init.c
>> @@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
>>  unsigned long __meminitdata pgt_buf_end;
>>  unsigned long __meminitdata pgt_buf_top;
>>
>> +unsigned long __initdata early_pgt_buf_start;
>> +unsigned long __meminitdata early_pgt_buf_end;
>> +unsigned long __meminitdata early_pgt_buf_top;
>> +
>>  int after_bootmem;
>>
>>  int direct_gbpages
>> @@ -291,6 +295,11 @@ static void __init find_early_table_spac
>>       if (!base)
>>               panic("Cannot find space for the kernel page tables");
>>
>> +     init_memory_mapping(base, base + tables);
>> +     printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
>> +             base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
>> +             (early_pgt_buf_end << PAGE_SHIFT) - 1);
>> +
>>       pgt_buf_start = base >> PAGE_SHIFT;
>>       pgt_buf_end = pgt_buf_start;
>>       pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
>> @@ -437,6 +446,20 @@ void __init init_mem_mapping(void)
>>       early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
>>  }
>>
>> +RESERVE_BRK(early_pgt_alloc, 16384);
>> +
>> +void  __init early_alloc_pgt_buf(void)
>> +{
>> +     unsigned long tables = 13864;
>> +     phys_addr_t base;
>> +
>> +     base = __pa(extend_brk(tables, PAGE_SIZE));
>> +
>> +     early_pgt_buf_start = base >> PAGE_SHIFT;
>> +     early_pgt_buf_end = early_pgt_buf_start;
>> +     early_pgt_buf_top = early_pgt_buf_start + (tables >> PAGE_SHIFT);
>> +}
>> +
>>  /*
>>   * devmem_is_allowed() checks to see if /dev/mem access to a certain address
>>   * is valid. The argument is a physical page number.
>> Index: linux-2.6/arch/x86/mm/init_32.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init_32.c
>> +++ linux-2.6/arch/x86/mm/init_32.c
>> @@ -61,10 +61,14 @@ bool __read_mostly __vmalloc_start_set =
>>
>>  static __init void *alloc_low_page(void)
>>  {
>> -     unsigned long pfn = pgt_buf_end++;
>> +     unsigned long pfn;
>>       void *adr;
>>
>> -     if (pfn >= pgt_buf_top)
>> +     if (early_pgt_buf_end < early_pgt_buf_top)
>> +             pfn = early_pgt_buf_end++;
>> +     else if (pgt_buf_end < pgt_buf_top)
>> +             pfn = pgt_buf_end++;
>> +     else
>>               panic("alloc_low_page: ran out of memory");
>>
>>       adr = __va(pfn * PAGE_SIZE);
>> Index: linux-2.6/arch/x86/mm/init_64.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init_64.c
>> +++ linux-2.6/arch/x86/mm/init_64.c
>> @@ -318,7 +318,7 @@ void __init cleanup_highmap(void)
>>
>>  static __ref void *alloc_low_page(unsigned long *phys)
>>  {
>> -     unsigned long pfn = pgt_buf_end++;
>> +     unsigned long pfn;
>>       void *adr;
>>
>>       if (after_bootmem) {
>> @@ -328,7 +328,11 @@ static __ref void *alloc_low_page(unsign
>>               return adr;
>>       }
>>
>> -     if (pfn >= pgt_buf_top)
>> +     if (early_pgt_buf_end < early_pgt_buf_top)
>> +             pfn = early_pgt_buf_end++;
>> +     else if (pgt_buf_end < pgt_buf_top)
>> +             pfn = pgt_buf_end++;
>> +     else
>>               panic("alloc_low_page: ran out of memory");
>>
>>       adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-04 21:29           ` Yinghai Lu
@ 2012-10-05 21:04             ` Eric W. Biederman
  2012-10-05 21:19               ` Yinghai Lu
  0 siblings, 1 reply; 57+ messages in thread
From: Eric W. Biederman @ 2012-10-05 21:04 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Konrad Rzeszutek Wilk, Jacob Shin, Stefano Stabellini,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tejun Heo,
	linux-kernel

Yinghai Lu <yinghai@kernel.org> writes:

> On Thu, Oct 4, 2012 at 9:46 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>>> then kdump may have problem get big range again.
>>
>> Is there a git commit that explains what the 'big range' problem is?

At least on x86_64 this was recently tested and anywhere below 4G is
good, and there is a patch floating around somewhere to remove this
issue.

I don't know about x86_32.  

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 21:04             ` Eric W. Biederman
@ 2012-10-05 21:19               ` Yinghai Lu
  2012-10-05 21:32                 ` Eric W. Biederman
  0 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-05 21:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Konrad Rzeszutek Wilk, Jacob Shin, Stefano Stabellini,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tejun Heo,
	linux-kernel

On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> Is there a git commit that explains what the 'big range' problem is?
>
> At least on x86_64 this was recently tested and anywhere below 4G is
> good, and there is a patch floating around somewhere to remove this
> issue.

patch for kernel or kexec-tools?

Yinghai

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 21:19               ` Yinghai Lu
@ 2012-10-05 21:32                 ` Eric W. Biederman
  2012-10-05 21:37                   ` Yinghai Lu
  2012-10-06  0:17                   ` H. Peter Anvin
  0 siblings, 2 replies; 57+ messages in thread
From: Eric W. Biederman @ 2012-10-05 21:32 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Konrad Rzeszutek Wilk, Jacob Shin, Stefano Stabellini,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tejun Heo,
	linux-kernel

Yinghai Lu <yinghai@kernel.org> writes:

> On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>> Is there a git commit that explains what the 'big range' problem is?
>>
>> At least on x86_64 this was recently tested and anywhere below 4G is
>> good, and there is a patch floating around somewhere to remove this
>> issue.
>
> patch for kernel or kexec-tools?

kernel.

The sgi guys needed a kdump kernel with 1G of ram to dump their all of
the memory on one of their crazy large machines and so investigated
this.

Basically they found that a kdump kernel loaded anywhere < 4G worked,
the only change that was needed was to relaxy the 896M hard code.

In one test they had a kdump kernel loaded above 2G.

Eric


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 21:32                 ` Eric W. Biederman
@ 2012-10-05 21:37                   ` Yinghai Lu
  2012-10-05 21:41                     ` Eric W. Biederman
  2012-10-06  0:17                   ` H. Peter Anvin
  1 sibling, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-05 21:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Konrad Rzeszutek Wilk, Jacob Shin, Stefano Stabellini,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tejun Heo,
	linux-kernel

On Fri, Oct 5, 2012 at 2:32 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Yinghai Lu <yinghai@kernel.org> writes:
>
>> On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>> Is there a git commit that explains what the 'big range' problem is?
>>>
>>> At least on x86_64 this was recently tested and anywhere below 4G is
>>> good, and there is a patch floating around somewhere to remove this
>>> issue.
>>
>> patch for kernel or kexec-tools?
>
> kernel.
>
> The sgi guys needed a kdump kernel with 1G of ram to dump their all of
> the memory on one of their crazy large machines and so investigated
> this.
>
> Basically they found that a kdump kernel loaded anywhere < 4G worked,
> the only change that was needed was to relaxy the 896M hard code.
>
> In one test they had a kdump kernel loaded above 2G.

with bzImage or vmlinux?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 21:37                   ` Yinghai Lu
@ 2012-10-05 21:41                     ` Eric W. Biederman
  2012-10-05 21:43                       ` Yinghai Lu
  2012-10-06  0:18                       ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit H. Peter Anvin
  0 siblings, 2 replies; 57+ messages in thread
From: Eric W. Biederman @ 2012-10-05 21:41 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Konrad Rzeszutek Wilk, Jacob Shin, Stefano Stabellini,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tejun Heo,
	linux-kernel

Yinghai Lu <yinghai@kernel.org> writes:

> with bzImage or vmlinux?

bzImage I presume.  Certainly the bzImage has lost it's 896M limit,
which is where ultimiately the 896M limite came from.

Eric


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 21:41                     ` Eric W. Biederman
@ 2012-10-05 21:43                       ` Yinghai Lu
  2012-10-05 22:01                         ` 896MB address limit (was: Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit) Eric W. Biederman
  2012-10-06  0:18                       ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit H. Peter Anvin
  1 sibling, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-05 21:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Konrad Rzeszutek Wilk, Jacob Shin, Stefano Stabellini,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tejun Heo,
	linux-kernel

On Fri, Oct 5, 2012 at 2:41 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Yinghai Lu <yinghai@kernel.org> writes:
>
>> with bzImage or vmlinux?
>
> bzImage I presume.  Certainly the bzImage has lost it's 896M limit,
> which is where ultimiately the 896M limite came from.

they are using updated kexec-tools ?

last time when i checked the code for kexec-tools
found the 896M problem was from kexec-tools bzimage support.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* 896MB address limit (was: Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit)
  2012-10-05 21:43                       ` Yinghai Lu
@ 2012-10-05 22:01                         ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2012-10-05 22:01 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Konrad Rzeszutek Wilk, Jacob Shin, Stefano Stabellini,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tejun Heo,
	linux-kernel, kexec, Cliff Wickman


I am going to see about merging these two threads.

Yinghai Lu <yinghai@kernel.org> writes:

> On Fri, Oct 5, 2012 at 2:41 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Yinghai Lu <yinghai@kernel.org> writes:
>>
>>> with bzImage or vmlinux?
>>
>> bzImage I presume.  Certainly the bzImage has lost it's 896M limit,
>> which is where ultimiately the 896M limite came from.
>
> they are using updated kexec-tools ?
>
> last time when i checked the code for kexec-tools
> found the 896M problem was from kexec-tools bzimage support.

Cliff Wickman was the guy at sgi running the tests.

To the best of my knowledge he was runing an up to date kexec-tools and
was loading a bzImage.  Of course his initial reaction was where did the
896M limit come from, as he had just updated to a kernel with the limit
a few weeks ago.

YH please talk to Cliff directly.

Eric


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 21:32                 ` Eric W. Biederman
  2012-10-05 21:37                   ` Yinghai Lu
@ 2012-10-06  0:17                   ` H. Peter Anvin
  2012-10-06  0:28                     ` Eric W. Biederman
  1 sibling, 1 reply; 57+ messages in thread
From: H. Peter Anvin @ 2012-10-06  0:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Yinghai Lu, Konrad Rzeszutek Wilk, Jacob Shin,
	Stefano Stabellini, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	linux-kernel

On 10/05/2012 02:32 PM, Eric W. Biederman wrote:
> Yinghai Lu <yinghai@kernel.org> writes:
>
>> On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>> Is there a git commit that explains what the 'big range' problem is?
>>>
>>> At least on x86_64 this was recently tested and anywhere below 4G is
>>> good, and there is a patch floating around somewhere to remove this
>>> issue.
>>
>> patch for kernel or kexec-tools?
>
> kernel.
>
> The sgi guys needed a kdump kernel with 1G of ram to dump their all of
> the memory on one of their crazy large machines and so investigated
> this.
>
> Basically they found that a kdump kernel loaded anywhere < 4G worked,
> the only change that was needed was to relaxy the 896M hard code.
>
> In one test they had a kdump kernel loaded above 2G.
>

Seriously, any case where we can't load anywhere in physical ram on 
x86-64 is a bug.  i386 is another matter.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 21:41                     ` Eric W. Biederman
  2012-10-05 21:43                       ` Yinghai Lu
@ 2012-10-06  0:18                       ` H. Peter Anvin
  2012-10-06  0:45                         ` Eric W. Biederman
  1 sibling, 1 reply; 57+ messages in thread
From: H. Peter Anvin @ 2012-10-06  0:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Yinghai Lu, Konrad Rzeszutek Wilk, Jacob Shin,
	Stefano Stabellini, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	linux-kernel

On 10/05/2012 02:41 PM, Eric W. Biederman wrote:
> Yinghai Lu <yinghai@kernel.org> writes:
>
>> with bzImage or vmlinux?
>
> bzImage I presume.  Certainly the bzImage has lost it's 896M limit,
> which is where ultimiately the 896M limite came from.
>

~896M (actually comes from i386, not from bzImage...

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-06  0:17                   ` H. Peter Anvin
@ 2012-10-06  0:28                     ` Eric W. Biederman
  2012-10-06  0:36                       ` H. Peter Anvin
  0 siblings, 1 reply; 57+ messages in thread
From: Eric W. Biederman @ 2012-10-06  0:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Yinghai Lu, Konrad Rzeszutek Wilk, Jacob Shin,
	Stefano Stabellini, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	linux-kernel

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 10/05/2012 02:32 PM, Eric W. Biederman wrote:
>> Yinghai Lu <yinghai@kernel.org> writes:
>>
>>> On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>>>>> Is there a git commit that explains what the 'big range' problem is?
>>>>
>>>> At least on x86_64 this was recently tested and anywhere below 4G is
>>>> good, and there is a patch floating around somewhere to remove this
>>>> issue.
>>>
>>> patch for kernel or kexec-tools?
>>
>> kernel.
>>
>> The sgi guys needed a kdump kernel with 1G of ram to dump their all of
>> the memory on one of their crazy large machines and so investigated
>> this.
>>
>> Basically they found that a kdump kernel loaded anywhere < 4G worked,
>> the only change that was needed was to relaxy the 896M hard code.
>>
>> In one test they had a kdump kernel loaded above 2G.
>>
>
> Seriously, any case where we can't load anywhere in physical ram on x86-64 is a
> bug.  i386 is another matter.

As I recall there are data structures like the IDT that only have a
32bit base address.

According to the bzImage header we don't support ramdisks above 4G.
I think we also have a 32bit address for the kernel command line
in the bzImage header.

In the case of kdump in particular there is a need for DMAable
memory and in general that means memory below 4G.  So as long
as we only support one memory extent for kdump it makes sense
for that segment to be below 4G.

For a normal x86_64 kernel which gets to use most of the memory it
definitely should be loadable anywhere in memory.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-06  0:28                     ` Eric W. Biederman
@ 2012-10-06  0:36                       ` H. Peter Anvin
  0 siblings, 0 replies; 57+ messages in thread
From: H. Peter Anvin @ 2012-10-06  0:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Yinghai Lu, Konrad Rzeszutek Wilk, Jacob Shin,
	Stefano Stabellini, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	linux-kernel

On 10/05/2012 05:28 PM, Eric W. Biederman wrote:
>>
>> Seriously, any case where we can't load anywhere in physical ram on x86-64 is a
>> bug.  i386 is another matter.
>
> As I recall there are data structures like the IDT that only have a
> 32bit base address.
>

Not true.  The only one I know of is memory for the trampoline which has 
to be below 1M.  The < 1M space is already handled specially for good 
reason.

> According to the bzImage header we don't support ramdisks above 4G.
> I think we also have a 32bit address for the kernel command line
> in the bzImage header.

There are pointers in the bzImage header, that is true.  We can fix that 
problem, though, at least for entry via the 64-bit entry point.

> In the case of kdump in particular there is a need for DMAable
> memory and in general that means memory below 4G.  So as long
> as we only support one memory extent for kdump it makes sense
> for that segment to be below 4G.

"In general" meaning "no iotlb"?  In that case you have some unknown 
address space restriction which may or may not be 4G...

> For a normal x86_64 kernel which gets to use most of the memory it
> definitely should be loadable anywhere in memory.

Yes.  We should fix problems, like the limitations in the boot_params 
structure.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-06  0:18                       ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit H. Peter Anvin
@ 2012-10-06  0:45                         ` Eric W. Biederman
  2012-10-06  1:02                           ` H. Peter Anvin
  0 siblings, 1 reply; 57+ messages in thread
From: Eric W. Biederman @ 2012-10-06  0:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Yinghai Lu, Konrad Rzeszutek Wilk, Jacob Shin,
	Stefano Stabellini, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	linux-kernel

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 10/05/2012 02:41 PM, Eric W. Biederman wrote:
>> Yinghai Lu <yinghai@kernel.org> writes:
>>
>>> with bzImage or vmlinux?
>>
>> bzImage I presume.  Certainly the bzImage has lost it's 896M limit,
>> which is where ultimiately the 896M limite came from.
>>
>
> ~896M (actually comes from i386, not from bzImage...

Right it was 1G - VMALLLOC_SIZE.

At some point that affected the boot protocol as the maximum address we
could load ramdisks.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-06  0:45                         ` Eric W. Biederman
@ 2012-10-06  1:02                           ` H. Peter Anvin
  0 siblings, 0 replies; 57+ messages in thread
From: H. Peter Anvin @ 2012-10-06  1:02 UTC (permalink / raw)
  To: ebiederm
  Cc: Yinghai Lu, Konrad Rzeszutek Wilk, Jacob Shin,
	Stefano Stabellini, Thomas Gleixner, Ingo Molnar, Tejun Heo,
	linux-kernel

That disappeared 10 years ago...

ebiederm@xmission.com wrote:

>"H. Peter Anvin" <hpa@zytor.com> writes:
>
>> On 10/05/2012 02:41 PM, Eric W. Biederman wrote:
>>> Yinghai Lu <yinghai@kernel.org> writes:
>>>
>>>> with bzImage or vmlinux?
>>>
>>> bzImage I presume.  Certainly the bzImage has lost it's 896M limit,
>>> which is where ultimiately the 896M limite came from.
>>>
>>
>> ~896M (actually comes from i386, not from bzImage...
>
>Right it was 1G - VMALLLOC_SIZE.
>
>At some point that affected the boot protocol as the maximum address we
>could load ramdisks.
>
>Eric

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 0/3] x86: pre mapping page table to make xen happy.
  2012-10-05 14:58                       ` Yinghai Lu
@ 2012-10-06  7:44                         ` Yinghai Lu
  2012-10-06  7:44                           ` [PATCH 1/3] x86: get early page table from BRK Yinghai Lu
                                             ` (2 more replies)
  2012-10-08  6:36                         ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
  1 sibling, 3 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-10-06  7:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: Stefano Stabellini, linux-kernel, Yinghai Lu

on top of tip/x86/mm2

also remove early_ioremap in page table accessing.


Yinghai Lu (3):
  x86: get early page table from BRK
  x86, mm: Don't clear page table if next range is ram
  x86, mm: Remove early_memremap workaround for page table accessing

 arch/x86/include/asm/init.h    |    4 ++
 arch/x86/include/asm/pgtable.h |    1 +
 arch/x86/kernel/setup.c        |    2 +
 arch/x86/mm/init.c             |   25 ++++++++++++
 arch/x86/mm/init_32.c          |    8 +++-
 arch/x86/mm/init_64.c          |   85 ++++++++++++++--------------------------
 6 files changed, 67 insertions(+), 58 deletions(-)

-- 
1.7.7


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/3] x86: get early page table from BRK
  2012-10-06  7:44                         ` [PATCH 0/3] x86: pre mapping page table to make xen happy Yinghai Lu
@ 2012-10-06  7:44                           ` Yinghai Lu
  2012-10-08 12:09                             ` Stefano Stabellini
  2012-10-06  7:44                           ` [PATCH 2/3] x86, mm: Don't clear page table if next range is ram Yinghai Lu
  2012-10-06  7:44                           ` [PATCH 3/3] x86, mm: Remove early_memremap workaround for page table accessing Yinghai Lu
  2 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-06  7:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: Stefano Stabellini, linux-kernel, Yinghai Lu

set pgt_buf early from BRK, and use it to map page table at first.

also use the left at first, then use new extend one.

-v2: extra xen call back for that new range.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/include/asm/init.h    |    4 ++++
 arch/x86/include/asm/pgtable.h |    1 +
 arch/x86/kernel/setup.c        |    2 ++
 arch/x86/mm/init.c             |   25 +++++++++++++++++++++++++
 arch/x86/mm/init_32.c          |    8 ++++++--
 arch/x86/mm/init_64.c          |    8 ++++++--
 6 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 4f13998..2f32eea 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -16,4 +16,8 @@ extern unsigned long __initdata pgt_buf_start;
 extern unsigned long __meminitdata pgt_buf_end;
 extern unsigned long __meminitdata pgt_buf_top;
 
+extern unsigned long __initdata early_pgt_buf_start;
+extern unsigned long __meminitdata early_pgt_buf_end;
+extern unsigned long __meminitdata early_pgt_buf_top;
+
 #endif /* _ASM_X86_INIT_32_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 52d40a1..25fa5bb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
 
 extern int direct_gbpages;
 void init_mem_mapping(void);
+void early_alloc_pgt_buf(void);
 
 /* local pte updates need not use xchg for locking */
 static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4989f80..7eb6855 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -896,6 +896,8 @@ void __init setup_arch(char **cmdline_p)
 
 	reserve_ibft_region();
 
+	early_alloc_pgt_buf();
+
 	/*
 	 * Need to conclude brk, before memblock_x86_fill()
 	 *  it could use memblock_find_in_range, could overlap with
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index cf662ba..c32eed1 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
 unsigned long __meminitdata pgt_buf_end;
 unsigned long __meminitdata pgt_buf_top;
 
+unsigned long __initdata early_pgt_buf_start;
+unsigned long __meminitdata early_pgt_buf_end;
+unsigned long __meminitdata early_pgt_buf_top;
+
 int after_bootmem;
 
 int direct_gbpages
@@ -291,6 +295,11 @@ static void __init find_early_table_space(unsigned long start,
 	if (!base)
 		panic("Cannot find space for the kernel page tables");
 
+	init_memory_mapping(base, base + tables);
+	printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
+		base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
+		(early_pgt_buf_end << PAGE_SHIFT) - 1);
+
 	pgt_buf_start = base >> PAGE_SHIFT;
 	pgt_buf_end = pgt_buf_start;
 	pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
@@ -430,6 +439,8 @@ void __init init_mem_mapping(void)
 		x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
 				PFN_PHYS(pgt_buf_end));
 	}
+	x86_init.mapping.pagetable_reserve(PFN_PHYS(early_pgt_buf_start),
+					PFN_PHYS(early_pgt_buf_end));
 
 	/* stop the wrong using */
 	pgt_buf_top = 0;
@@ -437,6 +448,20 @@ void __init init_mem_mapping(void)
 	early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
 }
 
+RESERVE_BRK(early_pgt_alloc, 16384);
+
+void  __init early_alloc_pgt_buf(void)
+{
+	unsigned long tables = 16384;
+	phys_addr_t base;
+
+	base = __pa(extend_brk(tables, PAGE_SIZE));
+
+	early_pgt_buf_start = base >> PAGE_SHIFT;
+	early_pgt_buf_end = early_pgt_buf_start;
+	early_pgt_buf_top = early_pgt_buf_start + (tables >> PAGE_SHIFT);
+}
+
 /*
  * devmem_is_allowed() checks to see if /dev/mem access to a certain address
  * is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 11a5800..92c0f12 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -61,10 +61,14 @@ bool __read_mostly __vmalloc_start_set = false;
 
 static __init void *alloc_low_page(void)
 {
-	unsigned long pfn = pgt_buf_end++;
+	unsigned long pfn;
 	void *adr;
 
-	if (pfn >= pgt_buf_top)
+	if (early_pgt_buf_end < early_pgt_buf_top)
+		pfn = early_pgt_buf_end++;
+	else if (pgt_buf_end < pgt_buf_top)
+		pfn = pgt_buf_end++;
+	else
 		panic("alloc_low_page: ran out of memory");
 
 	adr = __va(pfn * PAGE_SIZE);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ab558eb..5375cf0 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -316,7 +316,7 @@ void __init cleanup_highmap(void)
 
 static __ref void *alloc_low_page(unsigned long *phys)
 {
-	unsigned long pfn = pgt_buf_end++;
+	unsigned long pfn;
 	void *adr;
 
 	if (after_bootmem) {
@@ -326,7 +326,11 @@ static __ref void *alloc_low_page(unsigned long *phys)
 		return adr;
 	}
 
-	if (pfn >= pgt_buf_top)
+	if (early_pgt_buf_end < early_pgt_buf_top)
+		pfn = early_pgt_buf_end++;
+	else if (pgt_buf_end < pgt_buf_top)
+		pfn = pgt_buf_end++;
+	else
 		panic("alloc_low_page: ran out of memory");
 
 	adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 2/3] x86, mm: Don't clear page table if next range is ram
  2012-10-06  7:44                         ` [PATCH 0/3] x86: pre mapping page table to make xen happy Yinghai Lu
  2012-10-06  7:44                           ` [PATCH 1/3] x86: get early page table from BRK Yinghai Lu
@ 2012-10-06  7:44                           ` Yinghai Lu
  2012-10-09 15:46                             ` Konrad Rzeszutek Wilk
  2012-10-06  7:44                           ` [PATCH 3/3] x86, mm: Remove early_memremap workaround for page table accessing Yinghai Lu
  2 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-06  7:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: Stefano Stabellini, linux-kernel, Yinghai Lu

During adding code from BRK to map buffer for final page table,

It should be safe to remove early_memmap for page table accessing.

But get panic after that.

It turns out we clear the initial page table wrongly for next range
that is separated by holes.
And it only happens when we are trying to map range one by one range.

After checking before clearing the related page table to fix the problem.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/mm/init_64.c |   39 +++++++++++++++++++--------------------
 1 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5375cf0..0348a02 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -367,20 +367,21 @@ static unsigned long __meminit
 phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
 	      pgprot_t prot)
 {
-	unsigned pages = 0;
+	unsigned long pages = 0, next;
 	unsigned long last_map_addr = end;
 	int i;
 
 	pte_t *pte = pte_page + pte_index(addr);
 
-	for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
+	for (i = pte_index(addr); i < PTRS_PER_PTE; i++, addr = next, pte++) {
 
+		next = (addr & PAGE_MASK) + PAGE_SIZE;
 		if (addr >= end) {
-			if (!after_bootmem) {
-				for(; i < PTRS_PER_PTE; i++, pte++)
-					set_pte(pte, __pte(0));
-			}
-			break;
+			if (!after_bootmem &&
+			    addr < (2UL<<30) &&
+			    !e820_any_mapped(addr & PAGE_MASK, next, 0))
+				set_pte(pte, __pte(0));
+			continue;
 		}
 
 		/*
@@ -422,16 +423,15 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
 		pte_t *pte;
 		pgprot_t new_prot = prot;
 
+		next = (address & PMD_MASK) + PMD_SIZE;
 		if (address >= end) {
-			if (!after_bootmem) {
-				for (; i < PTRS_PER_PMD; i++, pmd++)
-					set_pmd(pmd, __pmd(0));
-			}
-			break;
+			if (!after_bootmem &&
+			    address < (2UL<<30) &&
+			    !e820_any_mapped(address & PMD_MASK, next, 0))
+				set_pmd(pmd, __pmd(0));
+			continue;
 		}
 
-		next = (address & PMD_MASK) + PMD_SIZE;
-
 		if (pmd_val(*pmd)) {
 			if (!pmd_large(*pmd)) {
 				spin_lock(&init_mm.page_table_lock);
@@ -498,13 +498,12 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
 		pmd_t *pmd;
 		pgprot_t prot = PAGE_KERNEL;
 
-		if (addr >= end)
-			break;
-
 		next = (addr & PUD_MASK) + PUD_SIZE;
-
-		if (!after_bootmem && !e820_any_mapped(addr, next, 0)) {
-			set_pud(pud, __pud(0));
+		if (addr >= end) {
+			if (!after_bootmem &&
+			    addr < (2UL<<30) &&
+			    !e820_any_mapped(addr & PUD_MASK, next, 0))
+				set_pud(pud, __pud(0));
 			continue;
 		}
 
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 3/3] x86, mm: Remove early_memremap workaround for page table accessing
  2012-10-06  7:44                         ` [PATCH 0/3] x86: pre mapping page table to make xen happy Yinghai Lu
  2012-10-06  7:44                           ` [PATCH 1/3] x86: get early page table from BRK Yinghai Lu
  2012-10-06  7:44                           ` [PATCH 2/3] x86, mm: Don't clear page table if next range is ram Yinghai Lu
@ 2012-10-06  7:44                           ` Yinghai Lu
  2012-10-09 15:48                             ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-06  7:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin, Tejun Heo
  Cc: Stefano Stabellini, linux-kernel, Yinghai Lu

Not needed anymore after premaping page table buf and not clear initial page
table wrongly.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/mm/init_64.c |   38 ++++----------------------------------
 1 files changed, 4 insertions(+), 34 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0348a02..e59c94f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -333,36 +333,12 @@ static __ref void *alloc_low_page(unsigned long *phys)
 	else
 		panic("alloc_low_page: ran out of memory");
 
-	adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
+	adr = __va(pfn * PAGE_SIZE);
 	clear_page(adr);
 	*phys  = pfn * PAGE_SIZE;
 	return adr;
 }
 
-static __ref void *map_low_page(void *virt)
-{
-	void *adr;
-	unsigned long phys, left;
-
-	if (after_bootmem)
-		return virt;
-
-	phys = __pa(virt);
-	left = phys & (PAGE_SIZE - 1);
-	adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
-	adr = (void *)(((unsigned long)adr) | left);
-
-	return adr;
-}
-
-static __ref void unmap_low_page(void *adr)
-{
-	if (after_bootmem)
-		return;
-
-	early_iounmap((void *)((unsigned long)adr & PAGE_MASK), PAGE_SIZE);
-}
-
 static unsigned long __meminit
 phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
 	      pgprot_t prot)
@@ -435,10 +411,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
 		if (pmd_val(*pmd)) {
 			if (!pmd_large(*pmd)) {
 				spin_lock(&init_mm.page_table_lock);
-				pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+				pte = (pte_t *)pmd_page_vaddr(*pmd);
 				last_map_addr = phys_pte_init(pte, address,
 								end, prot);
-				unmap_low_page(pte);
 				spin_unlock(&init_mm.page_table_lock);
 				continue;
 			}
@@ -474,7 +449,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
 
 		pte = alloc_low_page(&pte_phys);
 		last_map_addr = phys_pte_init(pte, address, end, new_prot);
-		unmap_low_page(pte);
 
 		spin_lock(&init_mm.page_table_lock);
 		pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
@@ -509,10 +483,9 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
 
 		if (pud_val(*pud)) {
 			if (!pud_large(*pud)) {
-				pmd = map_low_page(pmd_offset(pud, 0));
+				pmd = pmd_offset(pud, 0);
 				last_map_addr = phys_pmd_init(pmd, addr, end,
 							 page_size_mask, prot);
-				unmap_low_page(pmd);
 				__flush_tlb_all();
 				continue;
 			}
@@ -548,7 +521,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
 		pmd = alloc_low_page(&pmd_phys);
 		last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
 					      prot);
-		unmap_low_page(pmd);
 
 		spin_lock(&init_mm.page_table_lock);
 		pud_populate(&init_mm, pud, __va(pmd_phys));
@@ -584,17 +556,15 @@ kernel_physical_mapping_init(unsigned long start,
 			next = end;
 
 		if (pgd_val(*pgd)) {
-			pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+			pud = (pud_t *)pgd_page_vaddr(*pgd);
 			last_map_addr = phys_pud_init(pud, __pa(start),
 						 __pa(end), page_size_mask);
-			unmap_low_page(pud);
 			continue;
 		}
 
 		pud = alloc_low_page(&pud_phys);
 		last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
 						 page_size_mask);
-		unmap_low_page(pud);
 
 		spin_lock(&init_mm.page_table_lock);
 		pgd_populate(&init_mm, pgd, __va(pud_phys));
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit
  2012-10-05 14:58                       ` Yinghai Lu
  2012-10-06  7:44                         ` [PATCH 0/3] x86: pre mapping page table to make xen happy Yinghai Lu
@ 2012-10-08  6:36                         ` Yinghai Lu
  1 sibling, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-10-08  6:36 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: H. Peter Anvin, Konrad Rzeszutek Wilk, Thomas Gleixner,
	Ingo Molnar, Jacob Shin, Tejun Heo, linux-kernel

On Fri, Oct 5, 2012 at 7:58 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Fri, Oct 5, 2012 at 4:27 AM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> On Fri, 5 Oct 2012, Yinghai Lu wrote:
>>> On Thu, Oct 4, 2012 at 2:54 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> >
>>> > See my other post.  This is bringing up the Kernel Summit algorithm again.
>>> >
>>>
>>> sure. please check if you are ok with attached one on top of x86/mm2

I updated my for-x86-mm branch, It should be ok now with xen.

Can you please check it ?

	git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm

in addition patches in tip x86/mm2, there are 5 new patches added.

688d437: x86, mm: Use big page for small memory range
41e562f: x86, mm: only keep initial mapping for ram
24ac352: x86, mm: Remove early_memremap workaround for page table accessing
d9dd599: x86, mm: Don't clear page table if next range is ram
080acbe: x86, mm: get early page table from BRK

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/3] x86: get early page table from BRK
  2012-10-06  7:44                           ` [PATCH 1/3] x86: get early page table from BRK Yinghai Lu
@ 2012-10-08 12:09                             ` Stefano Stabellini
  0 siblings, 0 replies; 57+ messages in thread
From: Stefano Stabellini @ 2012-10-08 12:09 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin,
	Tejun Heo, Stefano Stabellini, linux-kernel

On Sat, 6 Oct 2012, Yinghai Lu wrote:
> set pgt_buf early from BRK, and use it to map page table at first.
> 
> also use the left at first, then use new extend one.
> 
> -v2: extra xen call back for that new range.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  arch/x86/include/asm/init.h    |    4 ++++
>  arch/x86/include/asm/pgtable.h |    1 +
>  arch/x86/kernel/setup.c        |    2 ++
>  arch/x86/mm/init.c             |   25 +++++++++++++++++++++++++
>  arch/x86/mm/init_32.c          |    8 ++++++--
>  arch/x86/mm/init_64.c          |    8 ++++++--
>  6 files changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
> index 4f13998..2f32eea 100644
> --- a/arch/x86/include/asm/init.h
> +++ b/arch/x86/include/asm/init.h
> @@ -16,4 +16,8 @@ extern unsigned long __initdata pgt_buf_start;
>  extern unsigned long __meminitdata pgt_buf_end;
>  extern unsigned long __meminitdata pgt_buf_top;
>  
> +extern unsigned long __initdata early_pgt_buf_start;
> +extern unsigned long __meminitdata early_pgt_buf_end;
> +extern unsigned long __meminitdata early_pgt_buf_top;
> +
>  #endif /* _ASM_X86_INIT_32_H */
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 52d40a1..25fa5bb 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
>  
>  extern int direct_gbpages;
>  void init_mem_mapping(void);
> +void early_alloc_pgt_buf(void);
>  
>  /* local pte updates need not use xchg for locking */
>  static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 4989f80..7eb6855 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -896,6 +896,8 @@ void __init setup_arch(char **cmdline_p)
>  
>  	reserve_ibft_region();
>  
> +	early_alloc_pgt_buf();
> +
>  	/*
>  	 * Need to conclude brk, before memblock_x86_fill()
>  	 *  it could use memblock_find_in_range, could overlap with
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index cf662ba..c32eed1 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
>  unsigned long __meminitdata pgt_buf_end;
>  unsigned long __meminitdata pgt_buf_top;
>  
> +unsigned long __initdata early_pgt_buf_start;
> +unsigned long __meminitdata early_pgt_buf_end;
> +unsigned long __meminitdata early_pgt_buf_top;
> +
>  int after_bootmem;
>  
>  int direct_gbpages
> @@ -291,6 +295,11 @@ static void __init find_early_table_space(unsigned long start,
>  	if (!base)
>  		panic("Cannot find space for the kernel page tables");
>  
> +	init_memory_mapping(base, base + tables);
> +	printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
> +		base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
> +		(early_pgt_buf_end << PAGE_SHIFT) - 1);
> +
>  	pgt_buf_start = base >> PAGE_SHIFT;
>  	pgt_buf_end = pgt_buf_start;
>  	pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> @@ -430,6 +439,8 @@ void __init init_mem_mapping(void)
>  		x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
>  				PFN_PHYS(pgt_buf_end));
>  	}
> +	x86_init.mapping.pagetable_reserve(PFN_PHYS(early_pgt_buf_start),
> +					PFN_PHYS(early_pgt_buf_end));

pagetable_reserve is not the right hook: pagetable_reserve tells the
subsystem that the memory range you are passing is going to be used for
pagetable pages. It is used to reserve that range using
memblock_reserve. On Xen is also used to mark RW any pages _outside_
that range that have been marked RO: implicitely we assume that the full
range is pgt_buf_start-pgt_buf_top and we mark it RO (see Xen memory
contraints on pagetable pages, as decribed by Konrad).

Calling pagetable_reserve(real_start, real_end) reserves
real_start-real_end as pagetable pages and frees
pgt_buf_start-real_start and real_end-pgt_buf_top.

So the problem is that at the moment we don't have a hook to say: "the
range of pagetable pages is pgt_buf_start-pgt_buf_top". In fact if you
give a look at arch/x86/xen/mmu.c you'll find few references to
pgt_buf_start, pgt_buf_end, pgt_buf_top, that shouldn't really be there.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] x86, mm: Don't clear page table if next range is ram
  2012-10-06  7:44                           ` [PATCH 2/3] x86, mm: Don't clear page table if next range is ram Yinghai Lu
@ 2012-10-09 15:46                             ` Konrad Rzeszutek Wilk
  2012-10-10  1:00                               ` Yinghai Lu
  0 siblings, 1 reply; 57+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-09 15:46 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin,
	Tejun Heo, Stefano Stabellini, linux-kernel

On Sat, Oct 06, 2012 at 12:44:28AM -0700, Yinghai Lu wrote:
> During adding code from BRK to map buffer for final page table,
> 
> It should be safe to remove early_memmap for page table accessing.
> 
> But get panic after that.
> 
> It turns out we clear the initial page table wrongly for next range
> that is separated by holes.

Were are the holes? Are these E820 holes?

> And it only happens when we are trying to map range one by one range.
> 
> After checking before clearing the related page table to fix the problem.

Huh?
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  arch/x86/mm/init_64.c |   39 +++++++++++++++++++--------------------
>  1 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 5375cf0..0348a02 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -367,20 +367,21 @@ static unsigned long __meminit
>  phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
>  	      pgprot_t prot)
>  {
> -	unsigned pages = 0;
> +	unsigned long pages = 0, next;
>  	unsigned long last_map_addr = end;
>  	int i;
>  
>  	pte_t *pte = pte_page + pte_index(addr);
>  
> -	for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
> +	for (i = pte_index(addr); i < PTRS_PER_PTE; i++, addr = next, pte++) {
>  
> +		next = (addr & PAGE_MASK) + PAGE_SIZE;
>  		if (addr >= end) {
> -			if (!after_bootmem) {
> -				for(; i < PTRS_PER_PTE; i++, pte++)
> -					set_pte(pte, __pte(0));
> -			}
> -			break;
> +			if (!after_bootmem &&
> +			    addr < (2UL<<30) &&

Why 2G?
> +			    !e820_any_mapped(addr & PAGE_MASK, next, 0))

What is the 0 parameter for?
> +				set_pte(pte, __pte(0));
> +			continue;
>  		}
>  
>  		/*
> @@ -422,16 +423,15 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
>  		pte_t *pte;
>  		pgprot_t new_prot = prot;
>  
> +		next = (address & PMD_MASK) + PMD_SIZE;
>  		if (address >= end) {
> -			if (!after_bootmem) {
> -				for (; i < PTRS_PER_PMD; i++, pmd++)
> -					set_pmd(pmd, __pmd(0));
> -			}
> -			break;
> +			if (!after_bootmem &&
> +			    address < (2UL<<30) &&
> +			    !e820_any_mapped(address & PMD_MASK, next, 0))
> +				set_pmd(pmd, __pmd(0));
> +			continue;
>  		}
>  
> -		next = (address & PMD_MASK) + PMD_SIZE;
> -
>  		if (pmd_val(*pmd)) {
>  			if (!pmd_large(*pmd)) {
>  				spin_lock(&init_mm.page_table_lock);
> @@ -498,13 +498,12 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
>  		pmd_t *pmd;
>  		pgprot_t prot = PAGE_KERNEL;
>  
> -		if (addr >= end)
> -			break;

Why do you get rid of that?
> -
>  		next = (addr & PUD_MASK) + PUD_SIZE;
> -
> -		if (!after_bootmem && !e820_any_mapped(addr, next, 0)) {
> -			set_pud(pud, __pud(0));
> +		if (addr >= end) {
> +			if (!after_bootmem &&
> +			    addr < (2UL<<30) &&
> +			    !e820_any_mapped(addr & PUD_MASK, next, 0))
> +				set_pud(pud, __pud(0));
>  			continue;
>  		}
>  
> -- 
> 1.7.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/3] x86, mm: Remove early_memremap workaround for page table accessing
  2012-10-06  7:44                           ` [PATCH 3/3] x86, mm: Remove early_memremap workaround for page table accessing Yinghai Lu
@ 2012-10-09 15:48                             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 57+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-09 15:48 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin,
	Tejun Heo, Stefano Stabellini, linux-kernel

On Sat, Oct 06, 2012 at 12:44:29AM -0700, Yinghai Lu wrote:
> Not needed anymore after premaping page table buf and not clear initial page
> table wrongly.

Your comment should include what patch made the iomap/iounmap part
unnecessary.

.. and also explain how this work-around is not required anymore.

> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  arch/x86/mm/init_64.c |   38 ++++----------------------------------
>  1 files changed, 4 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0348a02..e59c94f 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -333,36 +333,12 @@ static __ref void *alloc_low_page(unsigned long *phys)
>  	else
>  		panic("alloc_low_page: ran out of memory");
>  
> -	adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
> +	adr = __va(pfn * PAGE_SIZE);
>  	clear_page(adr);
>  	*phys  = pfn * PAGE_SIZE;
>  	return adr;
>  }
>  
> -static __ref void *map_low_page(void *virt)
> -{
> -	void *adr;
> -	unsigned long phys, left;
> -
> -	if (after_bootmem)
> -		return virt;
> -
> -	phys = __pa(virt);
> -	left = phys & (PAGE_SIZE - 1);
> -	adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
> -	adr = (void *)(((unsigned long)adr) | left);
> -
> -	return adr;
> -}
> -
> -static __ref void unmap_low_page(void *adr)
> -{
> -	if (after_bootmem)
> -		return;
> -
> -	early_iounmap((void *)((unsigned long)adr & PAGE_MASK), PAGE_SIZE);
> -}
> -
>  static unsigned long __meminit
>  phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
>  	      pgprot_t prot)
> @@ -435,10 +411,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
>  		if (pmd_val(*pmd)) {
>  			if (!pmd_large(*pmd)) {
>  				spin_lock(&init_mm.page_table_lock);
> -				pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> +				pte = (pte_t *)pmd_page_vaddr(*pmd);
>  				last_map_addr = phys_pte_init(pte, address,
>  								end, prot);
> -				unmap_low_page(pte);
>  				spin_unlock(&init_mm.page_table_lock);
>  				continue;
>  			}
> @@ -474,7 +449,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
>  
>  		pte = alloc_low_page(&pte_phys);
>  		last_map_addr = phys_pte_init(pte, address, end, new_prot);
> -		unmap_low_page(pte);
>  
>  		spin_lock(&init_mm.page_table_lock);
>  		pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> @@ -509,10 +483,9 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
>  
>  		if (pud_val(*pud)) {
>  			if (!pud_large(*pud)) {
> -				pmd = map_low_page(pmd_offset(pud, 0));
> +				pmd = pmd_offset(pud, 0);
>  				last_map_addr = phys_pmd_init(pmd, addr, end,
>  							 page_size_mask, prot);
> -				unmap_low_page(pmd);
>  				__flush_tlb_all();
>  				continue;
>  			}
> @@ -548,7 +521,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
>  		pmd = alloc_low_page(&pmd_phys);
>  		last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
>  					      prot);
> -		unmap_low_page(pmd);
>  
>  		spin_lock(&init_mm.page_table_lock);
>  		pud_populate(&init_mm, pud, __va(pmd_phys));
> @@ -584,17 +556,15 @@ kernel_physical_mapping_init(unsigned long start,
>  			next = end;
>  
>  		if (pgd_val(*pgd)) {
> -			pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> +			pud = (pud_t *)pgd_page_vaddr(*pgd);
>  			last_map_addr = phys_pud_init(pud, __pa(start),
>  						 __pa(end), page_size_mask);
> -			unmap_low_page(pud);
>  			continue;
>  		}
>  
>  		pud = alloc_low_page(&pud_phys);
>  		last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
>  						 page_size_mask);
> -		unmap_low_page(pud);
>  
>  		spin_lock(&init_mm.page_table_lock);
>  		pgd_populate(&init_mm, pgd, __va(pud_phys));
> -- 
> 1.7.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] x86, mm: Don't clear page table if next range is ram
  2012-10-09 15:46                             ` Konrad Rzeszutek Wilk
@ 2012-10-10  1:00                               ` Yinghai Lu
  2012-10-10 13:41                                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 57+ messages in thread
From: Yinghai Lu @ 2012-10-10  1:00 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin,
	Tejun Heo, Stefano Stabellini, linux-kernel

On Tue, Oct 9, 2012 at 8:46 AM, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote:
>> +                         !e820_any_mapped(addr & PAGE_MASK, next, 0))
>
> What is the 0 parameter for?

any type

if type != 0, the will only check entries with same type.

int
e820_any_mapped(u64 start, u64 end, unsigned type)
{
        int i;

        for (i = 0; i < e820.nr_map; i++) {
                struct e820entry *ei = &e820.map[i];

                if (type && ei->type != type)
                        continue;
                if (ei->addr >= end || ei->addr + ei->size <= start)
                        continue;
                return 1;
        }
        return 0;
}

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] x86, mm: Don't clear page table if next range is ram
  2012-10-10  1:00                               ` Yinghai Lu
@ 2012-10-10 13:41                                 ` Konrad Rzeszutek Wilk
  2012-10-10 14:43                                   ` Yinghai Lu
  0 siblings, 1 reply; 57+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-10 13:41 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin,
	Tejun Heo, Stefano Stabellini, linux-kernel

On Tue, Oct 09, 2012 at 06:00:12PM -0700, Yinghai Lu wrote:
> On Tue, Oct 9, 2012 at 8:46 AM, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote:
> >> +                         !e820_any_mapped(addr & PAGE_MASK, next, 0))
> >
> > What is the 0 parameter for?
> 
> any type

OK, which means that it either should have a #define for it, or at least
a comment, like:

0 /* any type */

as this would make it clear at first glance what it is - without having
to dive in e820_any_mapped function to determine that.

> 
> if type != 0, the will only check entries with same type.
> 
> int
> e820_any_mapped(u64 start, u64 end, unsigned type)
> {
>         int i;
> 
>         for (i = 0; i < e820.nr_map; i++) {
>                 struct e820entry *ei = &e820.map[i];
> 
>                 if (type && ei->type != type)
>                         continue;
>                 if (ei->addr >= end || ei->addr + ei->size <= start)
>                         continue;
>                 return 1;
>         }
>         return 0;
> }
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/3] x86, mm: Don't clear page table if next range is ram
  2012-10-10 13:41                                 ` Konrad Rzeszutek Wilk
@ 2012-10-10 14:43                                   ` Yinghai Lu
  0 siblings, 0 replies; 57+ messages in thread
From: Yinghai Lu @ 2012-10-10 14:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Jacob Shin,
	Tejun Heo, Stefano Stabellini, linux-kernel

On Wed, Oct 10, 2012 at 6:41 AM, Konrad Rzeszutek Wilk
<konrad@kernel.org> wrote:
> On Tue, Oct 09, 2012 at 06:00:12PM -0700, Yinghai Lu wrote:
>> On Tue, Oct 9, 2012 at 8:46 AM, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote:
>> >> +                         !e820_any_mapped(addr & PAGE_MASK, next, 0))
>> >
>> > What is the 0 parameter for?
>>
>> any type
>
> OK, which means that it either should have a #define for it, or at least
> a comment, like:
>
> 0 /* any type */
>
> as this would make it clear at first glance what it is - without having
> to dive in e820_any_mapped function to determine that.

yes, we should add E820_ANY_TYPE. and update e820_any_mapped to use it.

will address that later in another patch.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2012-10-10 14:43 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-30  7:57 [PATCH -v4 00/13] x86, mm: init_memory_mapping cleanup Yinghai Lu
2012-09-30  7:57 ` [PATCH 01/13] x86, mm: Add global page_size_mask and probe one time only Yinghai Lu
2012-09-30  7:57 ` [PATCH 02/13] x86, mm: Split out split_mem_range from init_memory_mapping Yinghai Lu
2012-09-30  7:57 ` [PATCH 03/13] x86, mm: Move init_memory_mapping calling out of setup.c Yinghai Lu
2012-09-30  7:57 ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
2012-10-01 11:00   ` Stefano Stabellini
2012-10-03 16:51     ` Jacob Shin
2012-10-03 18:34       ` H. Peter Anvin
2012-10-04 13:56       ` Konrad Rzeszutek Wilk
2012-10-04 21:52         ` H. Peter Anvin
2012-10-04 16:19       ` Yinghai Lu
2012-10-04 16:46         ` Konrad Rzeszutek Wilk
2012-10-04 21:29           ` Yinghai Lu
2012-10-05 21:04             ` Eric W. Biederman
2012-10-05 21:19               ` Yinghai Lu
2012-10-05 21:32                 ` Eric W. Biederman
2012-10-05 21:37                   ` Yinghai Lu
2012-10-05 21:41                     ` Eric W. Biederman
2012-10-05 21:43                       ` Yinghai Lu
2012-10-05 22:01                         ` 896MB address limit (was: Re: [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit) Eric W. Biederman
2012-10-06  0:18                       ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit H. Peter Anvin
2012-10-06  0:45                         ` Eric W. Biederman
2012-10-06  1:02                           ` H. Peter Anvin
2012-10-06  0:17                   ` H. Peter Anvin
2012-10-06  0:28                     ` Eric W. Biederman
2012-10-06  0:36                       ` H. Peter Anvin
2012-10-04 15:57     ` Yinghai Lu
2012-10-04 16:45       ` Konrad Rzeszutek Wilk
2012-10-04 21:21         ` Yinghai Lu
2012-10-04 21:40           ` Yinghai Lu
2012-10-04 21:41             ` H. Peter Anvin
2012-10-04 21:46               ` Yinghai Lu
2012-10-04 21:54                 ` H. Peter Anvin
2012-10-05  7:46                   ` Yinghai Lu
2012-10-05 11:27                     ` Stefano Stabellini
2012-10-05 14:58                       ` Yinghai Lu
2012-10-06  7:44                         ` [PATCH 0/3] x86: pre mapping page table to make xen happy Yinghai Lu
2012-10-06  7:44                           ` [PATCH 1/3] x86: get early page table from BRK Yinghai Lu
2012-10-08 12:09                             ` Stefano Stabellini
2012-10-06  7:44                           ` [PATCH 2/3] x86, mm: Don't clear page table if next range is ram Yinghai Lu
2012-10-09 15:46                             ` Konrad Rzeszutek Wilk
2012-10-10  1:00                               ` Yinghai Lu
2012-10-10 13:41                                 ` Konrad Rzeszutek Wilk
2012-10-10 14:43                                   ` Yinghai Lu
2012-10-06  7:44                           ` [PATCH 3/3] x86, mm: Remove early_memremap workaround for page table accessing Yinghai Lu
2012-10-09 15:48                             ` Konrad Rzeszutek Wilk
2012-10-08  6:36                         ` [PATCH 04/13] x86, mm: Revert back good_end setting for 64bit Yinghai Lu
2012-10-05 10:47       ` Stefano Stabellini
2012-09-30  7:57 ` [PATCH 05/13] x86, mm: Find early page table buffer altogether Yinghai Lu
2012-09-30  7:57 ` [PATCH 06/13] x86, mm: Separate out calculate_table_space_size() Yinghai Lu
2012-09-30  7:57 ` [PATCH 07/13] x86, mm: Move down two calculate_table_space_size down Yinghai Lu
2012-09-30  7:57 ` [PATCH 08/13] x86, mm: Set memblock initial limit to 1M Yinghai Lu
2012-09-30  7:57 ` [PATCH 09/13] x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix Yinghai Lu
2012-09-30  7:57 ` [PATCH 10/13] x86: Fixup code testing if a pfn is direct mapped Yinghai Lu
2012-09-30  7:57 ` [PATCH 11/13] x86: Only direct map addresses that are marked as E820_RAM Yinghai Lu
2012-09-30  7:57 ` [PATCH 12/13] x86/mm: calculate_table_space_size based on memory ranges that are being mapped Yinghai Lu
2012-09-30  7:57 ` [PATCH 13/13] x86, mm: Use func pointer to table size calculation and mapping Yinghai Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).