linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] [0/8] Various kernel mapping bug fixes
@ 2008-02-11  9:34 Andi Kleen
  2008-02-11  9:34 ` [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page Andi Kleen
                   ` (7 more replies)
  0 siblings, 8 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: tglx, mingo, linux-kernel


Relative to git-x86#mm 05afd57b4438d4650d93e6f54f465b0cdd2d9ea8) 

The first bug fix is the most important one -- it fixes a boot hang
on gbpages systems -- the rest is for random issues I found while looking 
at the code. Several are related to EFI (see caveats in the descriptions), 
which looks very broken currently without these patches, but definitely 
needs more testing.

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page
  2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
@ 2008-02-11  9:34 ` Andi Kleen
  2008-02-11  9:45   ` Thomas Gleixner
  2008-02-11  9:34 ` [PATCH] [2/8] CPA: Flush the caches when setting pages not present Andi Kleen
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: tglx, mingo, linux-kernel


Use correct page sizes and masks for GB pages in try_preserve_large_page()

This prevents a boot hang on a GB capable system with CONFIG_DIRECT_GBPAGES
enabled.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -278,8 +278,8 @@ try_preserve_large_page(pte_t *kpte, uns
 		break;
 #ifdef CONFIG_X86_64
 	case PG_LEVEL_1G:
-		psize = PMD_PAGE_SIZE;
-		pmask = PMD_PAGE_MASK;
+		psize = PUD_PAGE_SIZE;
+		pmask = PUD_PAGE_MASK;
 		break;
 #endif
 	default:

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] [2/8] CPA: Flush the caches when setting pages not present
  2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
  2008-02-11  9:34 ` [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page Andi Kleen
@ 2008-02-11  9:34 ` Andi Kleen
  2008-02-11 11:00   ` Ingo Molnar
  2008-02-11  9:34 ` [PATCH] [3/8] CPA: Test the correct mapping alias on x86-64 Andi Kleen
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: tglx, mingo, linux-kernel


e.g. the AMD64 pci-gart code sets pages not present to prevent potential 
cache coherency problems.  When doing this it is probably safer to flush the 
caches too. So consider clearing of the present bit as a cache flush indicator.

Note that debug pagealloc marks pages regularly not present either, but it won't
call this because it calls directly into a lower level function.

I have not actually seen failures from this, but it seems safer 
to do it this way.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -670,9 +670,16 @@ static int __change_page_attr_set_clr(st
 	return 0;
 }
 
-static inline int cache_attr(pgprot_t attr)
+static inline int cache_attr(pgprot_t set, pgprot_t clr)
 {
-	return pgprot_val(attr) &
+	/*
+	 * Clearing pages is usually done for cache cohereny reasons
+	 * (except for pagealloc debug, but that doesn't call this anyways)
+	 * It's safer to flush the caches in this case too.
+	 */
+	if (pgprot_val(clr) & _PAGE_PRESENT)
+		return 1;
+	return pgprot_val(set) &
 		(_PAGE_PAT | _PAGE_PAT_LARGE | _PAGE_PWT | _PAGE_PCD);
 }
 
@@ -709,7 +716,7 @@ static int change_page_attr_set_clr(unsi
 	 * No need to flush, when we did not set any of the caching
 	 * attributes:
 	 */
-	cache = cache_attr(mask_set);
+	cache = cache_attr(mask_set, mask_clr);
 
 	/*
 	 * On success we use clflush, when the CPU supports it to

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] [3/8] CPA: Test the correct mapping alias on x86-64
  2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
  2008-02-11  9:34 ` [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page Andi Kleen
  2008-02-11  9:34 ` [PATCH] [2/8] CPA: Flush the caches when setting pages not present Andi Kleen
@ 2008-02-11  9:34 ` Andi Kleen
  2008-02-11 11:49   ` Ingo Molnar
  2008-02-11  9:34 ` [PATCH] [4/8] CPA: Fix set_memory_x for ioremap Andi Kleen
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: tglx, mingo, linux-kernel


static_protections previously would test against the x86-64 kernel mapping 
twice. First against the unchanged symbol directly from the linker (which
always points into the kernel mapping) and then again it would manually
relocate the address into the kernel mapping and test again.

This patch reverses the second test instead to test against the direct mapping
(low) aliases virtual addresses which was probably intended in the first place.

Simply use __pa and __va for that.

This also allows to drop an ifdef because it will work both on 32bit and 64bit.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr.c |   22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -128,17 +128,9 @@ static void cpa_flush_range(unsigned lon
 #define HIGH_MAP_START	__START_KERNEL_map
 #define HIGH_MAP_END	(__START_KERNEL_map + KERNEL_TEXT_SIZE)
 
-
-/*
- * Converts a virtual address to a X86-64 highmap address
- */
-static unsigned long virt_to_highmap(void *address)
+static unsigned long virt_to_direct(void *address)
 {
-#ifdef CONFIG_X86_64
-	return __pa((unsigned long)address) + HIGH_MAP_START - phys_base;
-#else
-	return (unsigned long)address;
-#endif
+	return (unsigned long)__va(__pa(address));
 }
 
 /*
@@ -165,9 +157,9 @@ static inline pgprot_t static_protection
 	if (within(address, (unsigned long)_text, (unsigned long)_etext))
 		pgprot_val(forbidden) |= _PAGE_NX;
 	/*
-	 * Do the same for the x86-64 high kernel mapping
+	 * Do the same for the direct map alias of the x86-64 text mapping
 	 */
-	if (within(address, virt_to_highmap(_text), virt_to_highmap(_etext)))
+	if (within(address, virt_to_direct(_text), virt_to_direct(_etext)))
 		pgprot_val(forbidden) |= _PAGE_NX;
 
 	/* The .rodata section needs to be read-only */
@@ -175,10 +167,10 @@ static inline pgprot_t static_protection
 				(unsigned long)__end_rodata))
 		pgprot_val(forbidden) |= _PAGE_RW;
 	/*
-	 * Do the same for the x86-64 high kernel mapping
+	 * Do the same for the direct map alias of the x86-64 text mapping
 	 */
-	if (within(address, virt_to_highmap(__start_rodata),
-				virt_to_highmap(__end_rodata)))
+	if (within(address, virt_to_direct(__start_rodata),
+				virt_to_direct(__end_rodata)))
 		pgprot_val(forbidden) |= _PAGE_RW;
 
 	prot = __pgprot(pgprot_val(prot) & ~pgprot_val(forbidden));

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] [4/8] CPA: Fix set_memory_x for ioremap
  2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
                   ` (2 preceding siblings ...)
  2008-02-11  9:34 ` [PATCH] [3/8] CPA: Test the correct mapping alias on x86-64 Andi Kleen
@ 2008-02-11  9:34 ` Andi Kleen
  2008-02-11 12:27   ` Ingo Molnar
  2008-02-11  9:34 ` [PATCH] [5/8] Fix logic error in 64bit memory hotadd Andi Kleen
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: ying.huang, tglx, mingo, linux-kernel


EFI currently calls set_memory_x() on potentially ioremapped addresses.

This is problematic for several reasons: 

- The cpa code internally calls __pa on it which does not work for remapped
addresses and will give some random result.
- cpa will try to change all potential aliases (like the kernel mapping
on x86-64), but that is not needed for NX because the caller does only needs its 
specific virtual address executable. There is no requirement in the x86 architecture 
for nx bits to be coherent between mapping aliases. Also with the
previous problem of __pa returning a wrong address it would likely try to change 
some random other page if you're unlucky and the random result would
match the kernel text range.

There would be several possible ways to fix this:
- Simply don't set the NX bit in the original ioremap and drop
set_memory_x and add a ioremap_exec(). That would be my preferred solution, 
but unfortunately has been dismissed before
- Drop all __pas and always use the physical address derived 
from the looked up PTE. This would need some significant restructuring
and would only fix the first problem above, not the second.
- Special case NX clear to change any aliases. I chose this one
because it happens to fix both problems, so is both a fix
and a optimization.

This implies that it's still not safe calling set_memory_(not x) on
any ioremaped/vmalloced/module addresses. 

I don't have a EFI system so this is untested.

Cc: ying.huang@intel.com

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -28,6 +28,7 @@ struct cpa_data {
 	pgprot_t	mask_clr;
 	int		numpages;
 	int		flushtlb;
+	int		addronly;
 };
 
 static inline int
@@ -602,7 +603,7 @@ static int change_page_attr_addr(struct 
 	 * fixup the low mapping first. __va() returns the virtual
 	 * address in the linear mapping:
 	 */
-	if (within(address, HIGH_MAP_START, HIGH_MAP_END))
+	if (within(address, HIGH_MAP_START, HIGH_MAP_END) && !cpa->addronly)
 		address = (unsigned long) __va(phys_addr);
 #endif
 
@@ -615,7 +616,7 @@ static int change_page_attr_addr(struct 
 	 * If the physical address is inside the kernel map, we need
 	 * to touch the high mapped kernel as well:
 	 */
-	if (within(phys_addr, 0, KERNEL_TEXT_SIZE)) {
+	if (!cpa->addronly && within(phys_addr, 0, KERNEL_TEXT_SIZE)) {
 		/*
 		 * Calc the high mapping address. See __phys_addr()
 		 * for the non obvious details.
@@ -695,6 +696,8 @@ static int change_page_attr_set_clr(unsi
 	cpa.mask_set = mask_set;
 	cpa.mask_clr = mask_clr;
 	cpa.flushtlb = 0;
+	cpa.addronly = !pgprot_val(mask_set) &&
+			pgprot_val(mask_clr) == _PAGE_NX;
 
 	ret = __change_page_attr_set_clr(&cpa);
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] [5/8] Fix logic error in 64bit memory hotadd
  2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
                   ` (3 preceding siblings ...)
  2008-02-11  9:34 ` [PATCH] [4/8] CPA: Fix set_memory_x for ioremap Andi Kleen
@ 2008-02-11  9:34 ` Andi Kleen
  2008-02-11 12:46   ` Ingo Molnar
  2008-02-11  9:34 ` [PATCH] [6/8] Account overlapped mappings in end_pfn_map Andi Kleen
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: tglx, mingo, linux-kernel


The memory hotadd code assumes that the pud always exists already, but 
that might be not true. Allocate it if it isn't there.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/init_64.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -432,17 +432,22 @@ void __init_refok init_memory_mapping(un
 		unsigned long pud_phys;
 		pud_t *pud;
 
-		if (after_bootmem)
+		if (after_bootmem) {
 			pud = pud_offset(pgd, start & PGDIR_MASK);
-		else
+			if (!pud_present(*pud)) {
+				pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
+				if (!pud)
+					panic("Out of memory");
+			}
+			pud_phys = __pa(pud);
+		} else
 			pud = alloc_low_page(&pud_phys);
 
 		next = start + PGDIR_SIZE;
 		if (next > end)
 			next = end;
 		phys_pud_init(pud, __pa(start), __pa(next));
-		if (!after_bootmem)
-			set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
+		set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
 		unmap_low_page(pud);
 	}
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] [6/8] Account overlapped mappings in end_pfn_map
  2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
                   ` (4 preceding siblings ...)
  2008-02-11  9:34 ` [PATCH] [5/8] Fix logic error in 64bit memory hotadd Andi Kleen
@ 2008-02-11  9:34 ` Andi Kleen
  2008-02-11 13:08   ` Ingo Molnar
  2008-02-11 15:12   ` Arjan van de Ven
  2008-02-11  9:34 ` [PATCH] [7/8] Implement true end_pfn_mapped for 32bit Andi Kleen
  2008-02-11  9:34 ` [PATCH] [8/8] RFC: Fix some EFI problems Andi Kleen
  7 siblings, 2 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: ying.huang, tglx, mingo, linux-kernel


When end_pfn is not aligned to 2MB (or 1GB) then the kernel might
map more memory than end_pfn. Account this in end_pfn_mapped.

This is needed for other code that needs to know the true
mapping alias state (in this case EFI).

But it's also more correct in general

Cc: ying.huang@intel.com

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/kernel/setup_64.c |    4 +++-
 arch/x86/mm/init_64.c      |   33 +++++++++++++++++++++++----------
 include/asm-x86/proto.h    |    2 +-
 3 files changed, 27 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -287,7 +287,7 @@ __meminit void early_iounmap(void *addr,
 	__flush_tlb_all();
 }
 
-static void __meminit
+static unsigned long __meminit
 phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end)
 {
 	int i = pmd_index(address);
@@ -309,21 +309,25 @@ phys_pmd_init(pmd_t *pmd_page, unsigned 
 		set_pte((pte_t *)pmd,
 			pfn_pte(address >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
 	}
+	return address;
 }
 
-static void __meminit
+static unsigned long __meminit
 phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
 {
+	unsigned long r;
 	pmd_t *pmd = pmd_offset(pud, 0);
 	spin_lock(&init_mm.page_table_lock);
-	phys_pmd_init(pmd, address, end);
+	r = phys_pmd_init(pmd, address, end);
 	spin_unlock(&init_mm.page_table_lock);
 	__flush_tlb_all();
+	return r;
 }
 
-static void __meminit
+static unsigned long __meminit
 phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end)
 {
+	unsigned long true_end = end;
 	int i = pud_index(addr);
 
 	for (; i < PTRS_PER_PUD; i++, addr = (addr & PUD_MASK) + PUD_SIZE) {
@@ -342,13 +346,14 @@ phys_pud_init(pud_t *pud_page, unsigned 
 
 		if (pud_val(*pud)) {
 			if (!pud_large(*pud))
-				phys_pmd_update(pud, addr, end);
+				true_end = phys_pmd_update(pud, addr, end);
 			continue;
 		}
 
 		if (direct_gbpages) {
 			set_pte((pte_t *)pud,
 				pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+			true_end = (addr & PUD_MASK) + PUD_SIZE;
 			continue;
 		}
 
@@ -356,12 +361,14 @@ phys_pud_init(pud_t *pud_page, unsigned 
 
 		spin_lock(&init_mm.page_table_lock);
 		set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
-		phys_pmd_init(pmd, addr, end);
+		true_end = phys_pmd_init(pmd, addr, end);
 		spin_unlock(&init_mm.page_table_lock);
 
 		unmap_low_page(pmd);
 	}
 	__flush_tlb_all();
+
+	return true_end >> PAGE_SHIFT;
 }
 
 static void __init find_early_table_space(unsigned long end)
@@ -406,9 +413,10 @@ static void __init init_gbpages(void)
  * This runs before bootmem is initialized and gets pages directly from
  * the physical memory. To access them they are temporarily mapped.
  */
-void __init_refok init_memory_mapping(unsigned long start, unsigned long end)
+unsigned long __init_refok
+init_memory_mapping(unsigned long start, unsigned long end)
 {
-	unsigned long next;
+	unsigned long next, true_end = end;
 
 	pr_debug("init_memory_mapping\n");
 
@@ -446,7 +454,7 @@ void __init_refok init_memory_mapping(un
 		next = start + PGDIR_SIZE;
 		if (next > end)
 			next = end;
-		phys_pud_init(pud, __pa(start), __pa(next));
+		true_end = phys_pud_init(pud, __pa(start), __pa(next));
 		set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
 		unmap_low_page(pud);
 	}
@@ -458,6 +466,8 @@ void __init_refok init_memory_mapping(un
 	if (!after_bootmem)
 		reserve_early(table_start << PAGE_SHIFT,
 				 table_end << PAGE_SHIFT, "PGTABLE");
+
+	return true_end;
 }
 
 #ifndef CONFIG_NUMA
@@ -499,9 +509,12 @@ int arch_add_memory(int nid, u64 start, 
 	struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long r;
 	int ret;
 
-	init_memory_mapping(start, start + size-1);
+	r = init_memory_mapping(start, start + size-1);
+	if (r > end_pfn_map)
+		end_pfn_map = r;
 
 	ret = __add_pages(zone, start_pfn, nr_pages);
 	WARN_ON(1);
Index: linux/include/asm-x86/proto.h
===================================================================
--- linux.orig/include/asm-x86/proto.h
+++ linux/include/asm-x86/proto.h
@@ -7,7 +7,8 @@
 
 extern void early_idt_handler(void);
 
-extern void init_memory_mapping(unsigned long start, unsigned long end);
+extern unsigned long init_memory_mapping(unsigned long start,
+					 unsigned long end);
 
 extern void system_call(void);
 extern void syscall_init(void);
Index: linux/arch/x86/kernel/setup_64.c
===================================================================
--- linux.orig/arch/x86/kernel/setup_64.c
+++ linux/arch/x86/kernel/setup_64.c
@@ -341,7 +341,7 @@ void __init setup_arch(char **cmdline_p)
 
 	check_efer();
 
-	init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
+	end_pfn_map = init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
 	if (efi_enabled)
 		efi_init();
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] [7/8] Implement true end_pfn_mapped for 32bit
  2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
                   ` (5 preceding siblings ...)
  2008-02-11  9:34 ` [PATCH] [6/8] Account overlapped mappings in end_pfn_map Andi Kleen
@ 2008-02-11  9:34 ` Andi Kleen
  2008-02-12 19:39   ` Thomas Gleixner
  2008-02-11  9:34 ` [PATCH] [8/8] RFC: Fix some EFI problems Andi Kleen
  7 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: ying.huang, tglx, mingo, linux-kernel


Even on 32bit 2MB pages can map more memory than is in the true
max_low_pfn if end_pfn is not highmem and not aligned to 2MB. 
Add a end_pfn_map similar to x86-64 that accounts for this 
fact. This is important for code that really needs to know about
all mapping aliases. Needed for followup patches (in this case EFI)

Cc: ying.huang@intel.com

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/init_32.c     |    5 +++++
 include/asm-x86/page.h    |    4 +++-
 include/asm-x86/page_64.h |    1 -
 3 files changed, 8 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -50,6 +50,8 @@
 
 unsigned int __VMALLOC_RESERVE = 128 << 20;
 
+unsigned long end_pfn_map;
+
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 unsigned long highstart_pfn, highend_pfn;
 
@@ -193,6 +195,7 @@ static void __init kernel_physical_mappi
 				set_pmd(pmd, pfn_pmd(pfn, prot));
 
 				pfn += PTRS_PER_PTE;
+				end_pfn_map = pfn;
 				continue;
 			}
 			pte = one_page_table_init(pmd);
@@ -207,6 +210,7 @@ static void __init kernel_physical_mappi
 
 				set_pte(pte, pfn_pte(pfn, prot));
 			}
+			end_pfn_map = pfn;
 		}
 	}
 }
Index: linux/include/asm-x86/page.h
===================================================================
--- linux.orig/include/asm-x86/page.h
+++ linux/include/asm-x86/page.h
@@ -36,7 +36,7 @@
 #define max_pfn_mapped		end_pfn_map
 #else
 #include <asm/page_32.h>
-#define max_pfn_mapped		max_low_pfn
+#define max_pfn_mapped		end_pfn_map
 #endif	/* CONFIG_X86_64 */
 
 #define PAGE_OFFSET		((unsigned long)__PAGE_OFFSET)
@@ -51,6 +51,8 @@
 extern int page_is_ram(unsigned long pagenr);
 extern int devmem_is_allowed(unsigned long pagenr);
 
+extern unsigned long end_pfn_map;
+
 struct page;
 
 static void inline clear_user_page(void *page, unsigned long vaddr,
Index: linux/include/asm-x86/page_64.h
===================================================================
--- linux.orig/include/asm-x86/page_64.h
+++ linux/include/asm-x86/page_64.h
@@ -55,7 +55,6 @@ void clear_page(void *page);
 void copy_page(void *to, void *from);
 
 extern unsigned long end_pfn;
-extern unsigned long end_pfn_map;
 extern unsigned long phys_base;
 
 extern unsigned long __phys_addr(unsigned long);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] [8/8] RFC: Fix some EFI problems
  2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
                   ` (6 preceding siblings ...)
  2008-02-11  9:34 ` [PATCH] [7/8] Implement true end_pfn_mapped for 32bit Andi Kleen
@ 2008-02-11  9:34 ` Andi Kleen
  2008-02-12 20:04   ` Thomas Gleixner
  7 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-11  9:34 UTC (permalink / raw)
  To: ying.huang, tglx, mingo, linux-kernel


>From code review the EFI memory map handling has a couple of problems:

- The test for _WB memory was reversed so it would set cache able memory
to uncached
- It would always set a wrong uninitialized zero address to uncached
(so I suspect it always set the first few pages in phys memory to uncached,
that is why it may have gone unnoticed) 
- It would call set_memory_x() on a fixmap address that it doesn't
handle correct.
- Some other problems I commented in the code (but was unable to solve
for now) 

I changed the ioremaps to set the correct caching attributes
and also corrected the ordering so it looks roughly correct now.

This is an RFC, because I don't have a EFI system to test.

Cc: ying.huang@intel.com

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/kernel/efi.c    |   14 ++++++++------
 arch/x86/kernel/efi_64.c |    6 ++++--
 include/asm-x86/efi.h    |    5 +++--
 3 files changed, 15 insertions(+), 10 deletions(-)

Index: linux/arch/x86/kernel/efi.c
===================================================================
--- linux.orig/arch/x86/kernel/efi.c
+++ linux/arch/x86/kernel/efi.c
@@ -423,13 +423,15 @@ void __init efi_enter_virtual_mode(void)
 		size = md->num_pages << EFI_PAGE_SHIFT;
 		end = md->phys_addr + size;
 
-		if ((end >> PAGE_SHIFT) <= max_pfn_mapped)
+		/* RED-PEN does not handle overlapped areas */
+		if ((end >> PAGE_SHIFT) <= max_pfn_mapped) {
 			va = __va(md->phys_addr);
-		else
-			va = efi_ioremap(md->phys_addr, size);
-
-		if (md->attribute & EFI_MEMORY_WB)
-			set_memory_uc(md->virt_addr, size);
+			/* RED-PEN spec and ia64 have a lot more flags */
+			if (!(md->attribute & EFI_MEMORY_WB))
+				set_memory_uc(md->virt_addr, size);
+		} else
+			va = efi_ioremap(md->phys_addr, size,
+				!!(md->attribute & EFI_MEMORY_WB));
 
 		md->virt_addr = (u64) (unsigned long) va;
 
Index: linux/arch/x86/kernel/efi_64.c
===================================================================
--- linux.orig/arch/x86/kernel/efi_64.c
+++ linux/arch/x86/kernel/efi_64.c
@@ -109,7 +109,8 @@ void __init efi_reserve_bootmem(void)
 				memmap.nr_map * memmap.desc_size);
 }
 
-void __iomem * __init efi_ioremap(unsigned long phys_addr, unsigned long size)
+void __iomem * __init efi_ioremap(unsigned long phys_addr, unsigned long size,
+				  int cache)
 {
 	static unsigned pages_mapped;
 	unsigned i, pages;
@@ -124,7 +125,8 @@ void __iomem * __init efi_ioremap(unsign
 
 	for (i = 0; i < pages; i++) {
 		__set_fixmap(FIX_EFI_IO_MAP_FIRST_PAGE - pages_mapped,
-			     phys_addr, PAGE_KERNEL);
+			     phys_addr,
+			     cache ? PAGE_KERNEL : PAGE_KERNEL_NOCACHE);
 		phys_addr += PAGE_SIZE;
 		pages_mapped++;
 	}
Index: linux/include/asm-x86/efi.h
===================================================================
--- linux.orig/include/asm-x86/efi.h
+++ linux/include/asm-x86/efi.h
@@ -33,7 +33,8 @@ extern unsigned long asmlinkage efi_call
 #define efi_call_virt6(f, a1, a2, a3, a4, a5, a6)	\
 	efi_call_virt(f, a1, a2, a3, a4, a5, a6)
 
-#define efi_ioremap(addr, size)			ioremap_cache(addr, size)
+#define efi_ioremap(addr, size, cache)	\
+	(cache ? ioremap_cache(addr, size) : ioremap_nocache(addr, size))
 
 #else /* !CONFIG_X86_32 */
 
@@ -86,7 +87,7 @@ extern u64 efi_call6(void *fp, u64 arg1,
 	efi_call6((void *)(efi.systab->runtime->f), (u64)(a1), (u64)(a2), \
 		  (u64)(a3), (u64)(a4), (u64)(a5), (u64)(a6))
 
-extern void *efi_ioremap(unsigned long addr, unsigned long size);
+extern void *efi_ioremap(unsigned long addr, unsigned long size, int cache);
 
 #endif /* CONFIG_X86_32 */
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page
  2008-02-11  9:34 ` [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page Andi Kleen
@ 2008-02-11  9:45   ` Thomas Gleixner
  2008-02-11 10:12     ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2008-02-11  9:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, linux-kernel

On Mon, 11 Feb 2008, Andi Kleen wrote:

> 
> Use correct page sizes and masks for GB pages in try_preserve_large_page()
> 
> This prevents a boot hang on a GB capable system with CONFIG_DIRECT_GBPAGES
> enabled.

Doh, yes. Applied.

Thanks,
	tglx

> Signed-off-by: Andi Kleen <ak@suse.de>
> 
> ---
>  arch/x86/mm/pageattr.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux/arch/x86/mm/pageattr.c
> ===================================================================
> --- linux.orig/arch/x86/mm/pageattr.c
> +++ linux/arch/x86/mm/pageattr.c
> @@ -278,8 +278,8 @@ try_preserve_large_page(pte_t *kpte, uns
>  		break;
>  #ifdef CONFIG_X86_64
>  	case PG_LEVEL_1G:
> -		psize = PMD_PAGE_SIZE;
> -		pmask = PMD_PAGE_MASK;
> +		psize = PUD_PAGE_SIZE;
> +		pmask = PUD_PAGE_MASK;
>  		break;
>  #endif
>  	default:
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page
  2008-02-11  9:45   ` Thomas Gleixner
@ 2008-02-11 10:12     ` Ingo Molnar
  2008-02-11 11:01       ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-02-11 10:12 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Andi Kleen, linux-kernel


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, 11 Feb 2008, Andi Kleen wrote:
> 
> > Use correct page sizes and masks for GB pages in 
> > try_preserve_large_page()
> > 
> > This prevents a boot hang on a GB capable system with 
> > CONFIG_DIRECT_GBPAGES enabled.
> 
> Doh, yes. Applied.

btw., v2.6.25-rc1 is not affected by this. Gbpages support is a "still 
cooking" feature only available in x86.git#mm, destined for .26, and 
even there it's default disabled and only available on Barcelona class 
hardware. (Some of these bits are upstream already to make maintenance 
easier, but they are inactive code.)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [2/8] CPA: Flush the caches when setting pages not present
  2008-02-11  9:34 ` [PATCH] [2/8] CPA: Flush the caches when setting pages not present Andi Kleen
@ 2008-02-11 11:00   ` Ingo Molnar
  2008-02-11 12:26     ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-02-11 11:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: tglx, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> e.g. the AMD64 pci-gart code sets pages not present to prevent 
> potential cache coherency problems.  When doing this it is probably 
> safer to flush the caches too. So consider clearing of the present bit 
> as a cache flush indicator.

uhm, but why? "Probably safer" is not the right kind of technical 
argument ;-)

The PCI-GART quirk which we use on some systems unmaps pages _not_ 
because of coherency problems (any cache coherency problems would be 
triggerable before we unmap them anyway), but due to _page aliasing_ 
problems: these pages might be in our general e820 map so we must 
disable them.

( in any case, the CPA code does not deal with cache attribute coherency 
  - that is topic of the upcoming PAT feature. We still largely rely on 
  MTRRs to get cache coherency right. )

furthermore, your patch does not even do what you claim it does, because
it is an elaborate NOP on most modern CPUs:

> +static inline int cache_attr(pgprot_t set, pgprot_t clr)
>  {
> -	return pgprot_val(attr) &
> +	/*
> +	 * Clearing pages is usually done for cache cohereny reasons
> +	 * (except for pagealloc debug, but that doesn't call this anyways)
> +	 * It's safer to flush the caches in this case too.
> +	 */
> +	if (pgprot_val(clr) & _PAGE_PRESENT)
> +		return 1;

as clflush does not work on not present pages...

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page
  2008-02-11 10:12     ` Ingo Molnar
@ 2008-02-11 11:01       ` Andi Kleen
  0 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-11 11:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Thomas Gleixner, linux-kernel

Ingo Molnar <mingo@elte.hu> writes:
> 
> btw., v2.6.25-rc1 is not affected by this. Gbpages support is a "still 
> cooking" feature only available in x86.git#mm, destined for .26, and 
> even there it's default disabled and only available on Barcelona class 
> hardware. (Some of these bits are upstream already to make maintenance 
> easier, but they are inactive code.)

This means it will be tested for ~6 months. Don't you think that's a little
excessive for such a simple patch? 

% git show 6dc6ffff706db6c70a98253e7dc7011e4343df6c | diffstat 
 init_64.c |   29 ++++++++++++++++++++++++-----
 1 files changed, 24 insertions(+), 5 deletions(-)

One option would be also to merge it to .25 and mark the
CONFIG option EXPERIMENTAL.

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [3/8] CPA: Test the correct mapping alias on x86-64
  2008-02-11  9:34 ` [PATCH] [3/8] CPA: Test the correct mapping alias on x86-64 Andi Kleen
@ 2008-02-11 11:49   ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2008-02-11 11:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: tglx, linux-kernel, Arjan van de Ven


* Andi Kleen <ak@suse.de> wrote:

> static_protections previously would test against the x86-64 kernel 
> mapping twice. First against the unchanged symbol directly from the 
> linker (which always points into the kernel mapping) and then again it 
> would manually relocate the address into the kernel mapping and test 
> again.
> 
> This patch reverses the second test instead to test against the direct 
> mapping (low) aliases virtual addresses which was probably intended in 
> the first place.
> 
> Simply use __pa and __va for that.

thanks, applied.

( the practical implications of this are low because we do not utilize 
  the low direct aliases for execution. It needs to be fixed 
  nevertheless (will be needed for PAT later on anyway) and your cleanup 
  and #ifdef reduction is nice to have as well. )

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [2/8] CPA: Flush the caches when setting pages not present
  2008-02-11 11:00   ` Ingo Molnar
@ 2008-02-11 12:26     ` Andi Kleen
  0 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-11 12:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: tglx, linux-kernel

On Monday 11 February 2008 12:00:25 Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > e.g. the AMD64 pci-gart code sets pages not present to prevent 
> > potential cache coherency problems.  When doing this it is probably 
> > safer to flush the caches too. So consider clearing of the present bit 
> > as a cache flush indicator.
> 
> uhm, but why?  "Probably safer" is not the right kind of technical  
> argument ;-)

It is safer in my opinion, but I have not seen a concrete bug triggered by
it so I qualified it.

Also it is quite common to qualify technical arguments this way BTW.
 
> The PCI-GART quirk which we use on some systems unmaps pages _not_ 
> because of coherency problems (any cache coherency problems would be 
> triggerable before we unmap them anyway),

The GART is unmapped to avoid a cache coherency problem. AGP 
normally requires cache flushes and changing to uncached to map  
pages into the GART. Otherwise you could have cache incoherence
between the original page and the remapped page.
Since it would be very expensive to call cpa on each pci_map_* 
to turn the pages uncached the  IOMMU part of the GART is unmapped instead 
so that the CPU only ever sees the original memory; never the remapped
part.

> but due to _page aliasing_  
> problems: these pages might be in our general e820 map so we must 
> disable them.
> 
> ( in any case, the CPA code does not deal with cache attribute coherency 
>   - that is topic of the upcoming PAT feature. We still largely rely on 
>   MTRRs to get cache coherency right. )

If it wouldn't deal with cache coherency surely it wouldn't need to flush
caches ...

I started pageattr.c originally to deal with a concrete cache coherency
bugs (AGP cache corruptions between GART and original page on Athlon K7).
This is so that when a page is remapped in the GART it is changed in the 
direct mapping to uncached. Without that there was a reproducible cache corruption
on some systems which was very hard to track down and debug.

Even with all your changes it still does that.

All the other uses like DEBUG_RODATA or DEBUG_PAGEALLOC only came later.

 
> furthermore, your patch does not even do what you claim it does, because
> it is an elaborate NOP on most modern CPUs:
> 
> > +static inline int cache_attr(pgprot_t set, pgprot_t clr)
> >  {
> > -	return pgprot_val(attr) &
> > +	/*
> > +	 * Clearing pages is usually done for cache cohereny reasons
> > +	 * (except for pagealloc debug, but that doesn't call this anyways)
> > +	 * It's safer to flush the caches in this case too.
> > +	 */
> > +	if (pgprot_val(clr) & _PAGE_PRESENT)
> > +		return 1;
> 
> as clflush does not work on not present pages...

Hmm, that's true. It needs to force WBINVD for this case. 
I'll revise. Thanks.

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [4/8] CPA: Fix set_memory_x for ioremap
  2008-02-11  9:34 ` [PATCH] [4/8] CPA: Fix set_memory_x for ioremap Andi Kleen
@ 2008-02-11 12:27   ` Ingo Molnar
  2008-02-11 12:45     ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-02-11 12:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ying.huang, tglx, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> EFI currently calls set_memory_x() on potentially ioremapped 
> addresses.
> 
> This is problematic for several reasons:
> 
> - The cpa code internally calls __pa on it which does not work for 
> remapped addresses and will give some random result.

Wrong. We do call __pa() on vmalloc ranges (which is a known 
uncleanliness that we intend to fix), but contrary to your claim the 
result is not "random result". On 64-bit it's guaranteed to have a value 
above ~66 TB on 64-bit and hence fails all the filters later on so it 
has zero practical relevance at the moment. On 32-bit we transform it 
down to somewhere around 1GB - where we check it against the BIOS range 
filters - which again cannot trigger. But I do agree that it's unclean 
and needs fixing up.

Detailed analysis on 64-bit: we call __pa() here:

static int change_page_attr_addr(struct cpa_data *cpa)
...
        unsigned long phys_addr = __pa(address);

which for vmalloc area virtual addresses will indeed yield some really 
high (and invalid) physical address. That address will never trigger 
this check:

        if (within(address, HIGH_MAP_START, HIGH_MAP_END))
                address = (unsigned long) __va(phys_addr);

or this check:

        if (within(phys_addr, 0, KERNEL_TEXT_SIZE)) {

so we'll never actuall _use_ that phys_addr.

> - cpa will try to change all potential aliases (like the kernel 
> mapping on x86-64), but that is not needed for NX because the caller 
> does only needs its specific virtual address executable. There is no 
> requirement in the x86 architecture for nx bits to be coherent between 
> mapping aliases. Also with the previous problem of __pa returning a 
> wrong address it would likely try to change some random other page if 
> you're unlucky and the random result would match the kernel text 
> range.

wrong. That "random other page" is guaranteed to be above 66 TB 
physical.

Anyway, i agree that it's ugly and unintuitive and it's on our clean-up 
list. But your patch is not a good cleanup because it just hides the 
underlying weakness.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [4/8] CPA: Fix set_memory_x for ioremap
  2008-02-11 12:27   ` Ingo Molnar
@ 2008-02-11 12:45     ` Andi Kleen
  0 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-11 12:45 UTC (permalink / raw)
  To: Ingo Molnar, ying.huang; +Cc: tglx, linux-kernel


> Wrong. We do call __pa() on vmalloc ranges (which is a known 
> uncleanliness that we intend to fix),

AFAIK nobody does actually currently. Although I expect sooner
or later someone will try since __ioremap() lost its pgprot argument
that made it so powerful. Best would be probably to stick
in some bugs just to catch that.

> but contrary to your claim the  
> result is not "random result". On 64-bit it's guaranteed to have a value 
> above ~66 TB on 64-bit and hence fails all the filters later on so it 
> has zero practical relevance at the moment.

Note that 64bit EFI passes in a fixmap address (they just call
it efi_ioremap). Fixmaps are in the kernel mapping which __pa() handles
and then this gives a low number likely somewhere in memory
and might well trigger.

> On 32-bit we transform it  
> down to somewhere around 1GB - where we check it against the BIOS range 
> filters - which again cannot trigger. But I do agree that it's unclean 
> and needs fixing up.

Are you sure about this for all possible __PAGE_OFFSET values? e.g. consider
1:3 split. Also there is always relocated kernels where kernels might be loaded 
quite high.

> 
> static int change_page_attr_addr(struct cpa_data *cpa)
> ...
>         unsigned long phys_addr = __pa(address);
> 
> which for vmalloc area virtual addresses will indeed yield some really 
> high (and invalid) physical address. That address will never trigger 
> this check:
> 
>         if (within(address, HIGH_MAP_START, HIGH_MAP_END))
>                 address = (unsigned long) __va(phys_addr);

That doesn't check phys_addr at all?
 
> or this check:
> 
>         if (within(phys_addr, 0, KERNEL_TEXT_SIZE)) {
> 
> so we'll never actuall _use_ that phys_addr.



> and it's on our clean-up  
> list. But your patch is not a good cleanup because it just hides the 
> underlying weakness.

I never claimed it was a cleanup. It's a fix and a optimization 
(don't do unnecessary coherency between direct mapping and other 
mappings for clearing X -- this means some innocent pages in the
direct mapping won't get split) 

Anyways even if you don't want to fix this clear bug I would ask
to still consider the optimization independently.

-Andi



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd
  2008-02-11  9:34 ` [PATCH] [5/8] Fix logic error in 64bit memory hotadd Andi Kleen
@ 2008-02-11 12:46   ` Ingo Molnar
  2008-02-11 13:05     ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-02-11 12:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: tglx, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> The memory hotadd code assumes that the pud always exists already, but 
> that might be not true. Allocate it if it isn't there.

ok, this seems an like an ancient memory-hotplug bug. Does anyone even 
use memory hotplug currently? Did you find this bug via review, or did 
it trigger in practice?

Also, your fix, while it solves a real bug we want to fix, is not quite 
right for upstream integration yet. I can see 3 immediate problems with 
it:

> +			if (!pud_present(*pud)) {
> +				pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);

the GFP_ATOMIC here can fail.

The proper solution is to instead extend init_memory_mapping() with a 
gfp_t parameter and pass in GFP_ATOMIC from the early init code (where 
we must not schedule and where GFP_ATOMIC will succeed anyway), but do a 
GFP_KERNEL from arch_add_memory().

> +				if (!pud)
> +					panic("Out of memory");

the panic() here is very rude to the user in the hotplug usecase.

The proper solution is to extend init_memory_mapping() with a return 
value, and to check in the caller. arch_add_memory() obviously does not 
want to panic(), it wants to return -ENOMEM to mm/memory_hotplug.c.

and a small style nit while changing this code:

> +		} else
>  			pud = alloc_low_page(&pud_phys);

please add curly braces to the 'else' branch too. (we generally prefer 
symmetrical curly braces) Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd
  2008-02-11 12:46   ` Ingo Molnar
@ 2008-02-11 13:05     ` Andi Kleen
  2008-02-11 13:33       ` Ingo Molnar
  2008-02-12 10:35       ` Yasunori Goto
  0 siblings, 2 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-11 13:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: tglx, linux-kernel

On Monday 11 February 2008 13:46:25 Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > The memory hotadd code assumes that the pud always exists already, but 
> > that might be not true. Allocate it if it isn't there.
> 
> ok, this seems an like an ancient memory-hotplug bug.

Yes.

> Does anyone even  
> use memory hotplug currently? 

I don't know.

> Did you find this bug via review, or did  
> it trigger in practice?

Review.

> 
> Also, your fix, while it solves a real bug we want to fix, is not quite 
> right for upstream integration yet. I can see 3 immediate problems with 
> it:
> 
> > +			if (!pud_present(*pud)) {
> > +				pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
> 
> the GFP_ATOMIC here can fail.

The memory hotplug code already uses GFP_ATOMIC elsewhere (spp_getpage) 
 
> The proper solution is to instead extend init_memory_mapping() with a 
> gfp_t parameter and pass in GFP_ATOMIC from the early init code (where 
> we must not schedule and where GFP_ATOMIC will succeed anyway), but do a 
> GFP_KERNEL from arch_add_memory().

The existing code already does GFP_ATOMIC. I admit I haven't double
checked why it does that (didn't read the complete path) but I assume
it takes a spin lock somewhere.

If there is no lock doing a general clean up of all of them would probably
make sense.  But it would be orthogonal to my patch and I don't think
it's needed to fix this concrete bug.

The gfp argument is not needed though because this
case can be already distingushed by checking after_bootmem.

> The proper solution is to extend init_memory_mapping() with a return 
> value, and to check in the caller. arch_add_memory() obviously does not 
> want to panic(), it wants to return -ENOMEM to mm/memory_hotplug.c.

The existing code already panics elsewhere (spp_getpage); i just copied
that.

So in summary the panic&GFP_ATOMIC use are not good (I agree), but it's
not worse than what was in there before.

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map
  2008-02-11  9:34 ` [PATCH] [6/8] Account overlapped mappings in end_pfn_map Andi Kleen
@ 2008-02-11 13:08   ` Ingo Molnar
  2008-02-11 13:27     ` Andi Kleen
  2008-02-11 15:12   ` Arjan van de Ven
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-02-11 13:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ying.huang, tglx, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> When end_pfn is not aligned to 2MB (or 1GB) then the kernel might map 
> more memory than end_pfn. Account this in end_pfn_mapped.

can you see any practical relevance? Your patch description only deals 
with the mechanical details of the change instead of analyzing the most 
important thing: relevance and impact analysis. That makes it harder to 
add your patches and easier to miss a good fix accidentally. It also 
makes it quite a bit harder to trust your patches.

at a quick glance the relevance is to EFI only, in 
efi_enter_virtual_mode(). max_pfn_mapped is used as a differentiator 
between __va() and efi_ioremap(). AFAICS EFI will typically have its 
runtime code not right after the end of physical memory.

Nevertheless, i do agree that the max_pfn_mapped/end_pfn_map limit needs 
to be sharpened to reflect reality (for later PAT support).

your patch is also a bit unclean:

> -static void __meminit
> +static unsigned long __meminit
>  phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
>  {
> +	unsigned long r;
>  	pmd_t *pmd = pmd_offset(pud, 0);
>  	spin_lock(&init_mm.page_table_lock);
> -	phys_pmd_init(pmd, address, end);
> +	r = phys_pmd_init(pmd, address, end);
>  	spin_unlock(&init_mm.page_table_lock);
>  	__flush_tlb_all();
> +	return r;
>  }

a meaningless "r" parameter. Use "true_end" instead.

and here's another type of variable naming problem:

> +unsigned long __init_refok
> +init_memory_mapping(unsigned long start, unsigned long end)
>  {
> -	unsigned long next;
> +	unsigned long next, true_end = end;

that "true_end" is already a _pfn_, so name it accordingly: 
true_end_pfn. Because we switch between pfn and addr types it's 
important to point out where it's pfn and where it's a linear/physical 
address.

>  	struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	unsigned long r;
>  	int ret;
>  
> -	init_memory_mapping(start, start + size-1);
> +	r = init_memory_mapping(start, start + size-1);
> +	if (r > end_pfn_map)
> +		end_pfn_map = r;

rename "r" to "true_end_pfn".

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map
  2008-02-11 13:08   ` Ingo Molnar
@ 2008-02-11 13:27     ` Andi Kleen
  2008-02-11 13:55       ` Ingo Molnar
  2008-02-11 14:16       ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-11 13:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ying.huang, tglx, linux-kernel

On Monday 11 February 2008 14:08:43 Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > When end_pfn is not aligned to 2MB (or 1GB) then the kernel might map 
> > more memory than end_pfn. Account this in end_pfn_mapped.
> 
> can you see any practical relevance? 

Yes EFI needs to know this to decide if it should ioremap or not.

> Your patch description only deals  
> with the mechanical details of the change instead of analyzing the most 
> important thing: relevance and impact analysis. 

I repeat:

"This is needed for other code that needs to know the true
mapping alias state (in this case EFI)."

and

"This is important for code that really needs to know about
all mapping aliases. Needed for followup patches (in this case EFI)"

> That makes it harder to  
> add your patches and easier to miss a good fix accidentally. It also 
> makes it quite a bit harder to trust your patches.
> 
> at a quick glance the relevance is to EFI only, in 
> efi_enter_virtual_mode(). max_pfn_mapped is used as a differentiator 
> between __va() and efi_ioremap(). AFAICS EFI will typically have its 
> runtime code not right after the end of physical memory.
> 
> Nevertheless, i do agree that the max_pfn_mapped/end_pfn_map limit needs 
> to be sharpened to reflect reality (for later PAT support).
> 
> your patch is also a bit unclean:

Ok patch with hungarized variables appended.

-Andi

---

Account overlapped mappings in end_pfn_map

When end_pfn is not aligned to 2MB (or 1GB) then the kernel might
map more memory than end_pfn. Account this in end_pfn_mapped.

This is needed for other code that needs to know the true
mapping alias state (in this case EFI).

But it's also more correct in general

Cc: ying.huang@intel.com

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/kernel/setup_64.c |    4 +++-
 arch/x86/mm/init_64.c      |   33 +++++++++++++++++++++++----------
 include/asm-x86/proto.h    |    2 +-
 3 files changed, 27 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -287,7 +287,7 @@ __meminit void early_iounmap(void *addr,
 	__flush_tlb_all();
 }
 
-static void __meminit
+static unsigned long __meminit
 phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end)
 {
 	int i = pmd_index(address);
@@ -309,21 +309,25 @@ phys_pmd_init(pmd_t *pmd_page, unsigned 
 		set_pte((pte_t *)pmd,
 			pfn_pte(address >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
 	}
+	return address;
 }
 
-static void __meminit
+static unsigned long __meminit
 phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
 {
+	unsigned long true_end;
 	pmd_t *pmd = pmd_offset(pud, 0);
 	spin_lock(&init_mm.page_table_lock);
-	phys_pmd_init(pmd, address, end);
+	true_end = phys_pmd_init(pmd, address, end);
 	spin_unlock(&init_mm.page_table_lock);
 	__flush_tlb_all();
+	return true_end;
 }
 
-static void __meminit
+static unsigned long __meminit
 phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end)
 {
+	unsigned long true_end = end;
 	int i = pud_index(addr);
 
 	for (; i < PTRS_PER_PUD; i++, addr = (addr & PUD_MASK) + PUD_SIZE) {
@@ -342,13 +346,14 @@ phys_pud_init(pud_t *pud_page, unsigned 
 
 		if (pud_val(*pud)) {
 			if (!pud_large(*pud))
-				phys_pmd_update(pud, addr, end);
+				true_end = phys_pmd_update(pud, addr, end);
 			continue;
 		}
 
 		if (direct_gbpages) {
 			set_pte((pte_t *)pud,
 				pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+			true_end = (addr & PUD_MASK) + PUD_SIZE;
 			continue;
 		}
 
@@ -356,12 +361,14 @@ phys_pud_init(pud_t *pud_page, unsigned 
 
 		spin_lock(&init_mm.page_table_lock);
 		set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
-		phys_pmd_init(pmd, addr, end);
+		true_end = phys_pmd_init(pmd, addr, end);
 		spin_unlock(&init_mm.page_table_lock);
 
 		unmap_low_page(pmd);
 	}
 	__flush_tlb_all();
+
+	return true_end >> PAGE_SHIFT;
 }
 
 static void __init find_early_table_space(unsigned long end)
@@ -406,9 +413,10 @@ static void __init init_gbpages(void)
  * This runs before bootmem is initialized and gets pages directly from
  * the physical memory. To access them they are temporarily mapped.
  */
-void __init_refok init_memory_mapping(unsigned long start, unsigned long end)
+unsigned long __init_refok
+init_memory_mapping(unsigned long start, unsigned long end)
 {
-	unsigned long next;
+	unsigned long next, true_end = end;
 
 	pr_debug("init_memory_mapping\n");
 
@@ -446,7 +454,7 @@ void __init_refok init_memory_mapping(un
 		next = start + PGDIR_SIZE;
 		if (next > end)
 			next = end;
-		phys_pud_init(pud, __pa(start), __pa(next));
+		true_end = phys_pud_init(pud, __pa(start), __pa(next));
 		set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
 		unmap_low_page(pud);
 	}
@@ -458,6 +466,8 @@ void __init_refok init_memory_mapping(un
 	if (!after_bootmem)
 		reserve_early(table_start << PAGE_SHIFT,
 				 table_end << PAGE_SHIFT, "PGTABLE");
+
+	return true_end;
 }
 
 #ifndef CONFIG_NUMA
@@ -499,9 +509,12 @@ int arch_add_memory(int nid, u64 start, 
 	struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long true_end_pfn;
 	int ret;
 
-	init_memory_mapping(start, start + size-1);
+	true_end_pfn = init_memory_mapping(start, start + size-1);
+	if (true_end_pfn > end_pfn_map)
+		end_pfn_map = true_end_pfn;
 
 	ret = __add_pages(zone, start_pfn, nr_pages);
 	WARN_ON(1);
Index: linux/include/asm-x86/proto.h
===================================================================
--- linux.orig/include/asm-x86/proto.h
+++ linux/include/asm-x86/proto.h
@@ -7,7 +7,8 @@
 
 extern void early_idt_handler(void);
 
-extern void init_memory_mapping(unsigned long start, unsigned long end);
+extern unsigned long init_memory_mapping(unsigned long start,
+					 unsigned long end);
 
 extern void system_call(void);
 extern void syscall_init(void);
Index: linux/arch/x86/kernel/setup_64.c
===================================================================
--- linux.orig/arch/x86/kernel/setup_64.c
+++ linux/arch/x86/kernel/setup_64.c
@@ -341,7 +341,7 @@ void __init setup_arch(char **cmdline_p)
 
 	check_efer();
 
-	init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
+	end_pfn_map = init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
 	if (efi_enabled)
 		efi_init();
 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd
  2008-02-11 13:05     ` Andi Kleen
@ 2008-02-11 13:33       ` Ingo Molnar
  2008-02-11 13:44         ` Andi Kleen
  2008-02-12 10:35       ` Yasunori Goto
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-02-11 13:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: tglx, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> > Also, your fix, while it solves a real bug we want to fix, is not quite 
> > right for upstream integration yet. I can see 3 immediate problems with 
> > it:
> > 
> > > +			if (!pud_present(*pud)) {
> > > +				pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
> > 
> > the GFP_ATOMIC here can fail.
> 
> The memory hotplug code already uses GFP_ATOMIC elsewhere 
> (spp_getpage)

wrong. The _x86_ memory hotplug code uses GFP_ATOMIC elsewhere.
The generic memory hotplug code does not.

and the x86 memory hotplug code uses GFP_ATOMIC and panic() elsewhere 
because:

> The existing code already panics elsewhere (spp_getpage); i just 
> copied that.

and you had nothing to do with that "existing code"? git-log reveals 
that the GFP_ATOMIC and panic()-ing patch was added 2 years ago and was 
signed off by you:

  commit 44df75e629106efcada087cead6c3f33ed6bcc60
  Author: Matt Tolentino <metolent@cs.vt.edu>
  Date:   Tue Jan 17 07:03:41 2006 +0100

    [PATCH] x86_64: add x86-64 support for memory hot-add

  [...]
  Signed-off-by: Andi Kleen <ak@suse.de>

We (like most upstream kernel subsystems) generally do not accept 
patches into arch/x86 that spreads a buggy implementation detail 
further. Please submit a patch that cleans up the mess. Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd
  2008-02-11 13:33       ` Ingo Molnar
@ 2008-02-11 13:44         ` Andi Kleen
  0 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-11 13:44 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: tglx, linux-kernel

On Monday 11 February 2008 14:33:33 Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > > Also, your fix, while it solves a real bug we want to fix, is not quite 
> > > right for upstream integration yet. I can see 3 immediate problems with 
> > > it:
> > > 
> > > > +			if (!pud_present(*pud)) {
> > > > +				pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
> > > 
> > > the GFP_ATOMIC here can fail.
> > 
> > The memory hotplug code already uses GFP_ATOMIC elsewhere 
> > (spp_getpage)
> 
> wrong. The _x86_ memory hotplug code uses GFP_ATOMIC elsewhere.
> The generic memory hotplug code does not.

To be honest I'm a little tired now how you attempt to misinterpret every
word I write. Was it not clear from the context which code
was meant?

> 
> and the x86 memory hotplug code uses GFP_ATOMIC and panic() elsewhere 
> because:

I see it's all my fault. 

> 
> > The existing code already panics elsewhere (spp_getpage); i just 
> > copied that.
> 
> and you had nothing to do with that "existing code"? git-log reveals 
> that the GFP_ATOMIC and panic()-ing patch was added 2 years ago and was 
> signed off by you:

Should I point out all unclean and buggy code you ever signed
off?  @) Just alone in .25-rc1 there is enough of that.

> 
>   commit 44df75e629106efcada087cead6c3f33ed6bcc60
>   Author: Matt Tolentino <metolent@cs.vt.edu>
>   Date:   Tue Jan 17 07:03:41 2006 +0100
> 
>     [PATCH] x86_64: add x86-64 support for memory hot-add
> 
>   [...]
>   Signed-off-by: Andi Kleen <ak@suse.de>
> 
> We (like most upstream kernel subsystems) generally do not accept 
> patches into arch/x86 that spreads a buggy implementation detail 
> further. Please submit a patch that cleans up the mess. Thanks,

Ok I withdraw the patch under these circumstances. I'm not your coding
slave and I don't feel strongly enough about the hotplug case to 
put much more work into this.

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map
  2008-02-11 13:27     ` Andi Kleen
@ 2008-02-11 13:55       ` Ingo Molnar
  2008-02-11 14:16       ` Peter Zijlstra
  1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2008-02-11 13:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ying.huang, tglx, linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> > your patch is also a bit unclean:
> 
> Ok patch with hungarized variables appended.

My comments have nothing to do with "hungarized variables". You used 
clearly unclean and ambigious coding constructs like:

+static unsigned long __meminit
 phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
 {
+       unsigned long r;
        pmd_t *pmd = pmd_offset(pud, 0);
        spin_lock(&init_mm.page_table_lock);
-       phys_pmd_init(pmd, address, end);
+       r = phys_pmd_init(pmd, address, end);
        spin_unlock(&init_mm.page_table_lock);
        __flush_tlb_all();
+       return r;
 }

the "r" name is totally unintuitive and the reader of the code is given 
absolutely no idea what happens here.

The cleanliness rules about descriptive variable names are obvious, 
necessary and well respected throughout the kernel, by all other kernel 
contributors i know. For years you were allowed to merge such patches. 
You should really (re-)learn and accept the rules that all other kernel 
contributors have been following for years.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map
  2008-02-11 13:27     ` Andi Kleen
  2008-02-11 13:55       ` Ingo Molnar
@ 2008-02-11 14:16       ` Peter Zijlstra
  2008-02-11 14:24         ` Andi Kleen
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2008-02-11 14:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, ying.huang, tglx, linux-kernel


On Mon, 2008-02-11 at 14:27 +0100, Andi Kleen wrote:

> Ok patch with hungarized variables appended.

> -static void __meminit
> +static unsigned long __meminit
>  phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
>  {
> +	unsigned long true_end;
>  	pmd_t *pmd = pmd_offset(pud, 0);
>  	spin_lock(&init_mm.page_table_lock);
> -	phys_pmd_init(pmd, address, end);
> +	true_end = phys_pmd_init(pmd, address, end);
>  	spin_unlock(&init_mm.page_table_lock);
>  	__flush_tlb_all();
> +	return true_end;
>  }

Just for the record, Hungarian notation would have it like:

  ulTrueEnd

http://en.wikipedia.org/wiki/Hungarian_notation

And the kernel doesn't do that, to wit (from Documentation/CodingStyle):

 Linus Torvalds (against systems Hungarian): Encoding the type of a
function into the name (so-called Hungarian notation) is brain damaged -
the compiler knows the types anyway and can check those, and it only
confuses the programmer.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map
  2008-02-11 14:16       ` Peter Zijlstra
@ 2008-02-11 14:24         ` Andi Kleen
  2008-02-11 14:41           ` Sam Ravnborg
  0 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-11 14:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, ying.huang, tglx, linux-kernel

On Monday 11 February 2008 15:16:31 Peter Zijlstra wrote:
> 
> On Mon, 2008-02-11 at 14:27 +0100, Andi Kleen wrote:
> 
> > Ok patch with hungarized variables appended.
> 
> > -static void __meminit
> > +static unsigned long __meminit
> >  phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
> >  {
> > +	unsigned long true_end;
> >  	pmd_t *pmd = pmd_offset(pud, 0);
> >  	spin_lock(&init_mm.page_table_lock);
> > -	phys_pmd_init(pmd, address, end);
> > +	true_end = phys_pmd_init(pmd, address, end);
> >  	spin_unlock(&init_mm.page_table_lock);
> >  	__flush_tlb_all();
> > +	return true_end;
> >  }
> 
> Just for the record, Hungarian notation would have it like:
> 
>   ulTrueEnd

_pfn is variant of hungarian notation (just postfix instead of prefix); 
that is why I referred to it. 

Admittedly it was a little unfair pun with Ingo, but he really 
asked for it  in this case :-)


> http://en.wikipedia.org/wiki/Hungarian_notation
> 
> And the kernel doesn't do that, to wit (from Documentation/CodingStyle):
> 
>  Linus Torvalds (against systems Hungarian): Encoding the type of a
> function into the name (so-called Hungarian notation) is brain damaged -
> the compiler knows the types anyway and can check those, and it only
> confuses the programmer.

xxx_pfn is exactly that. 

BTW Coding style also recommends to use short variable names inside 
functions. Ingo asked me actually to violate that:

"
LOCAL variable names should be short, and to the point.  If you have
some random integer loop counter, it should probably be called "i".
Calling it "loop_counter" is non-productive, if there is no chance of it
being mis-understood.  Similarly, "tmp" can be just about any type of
variable that is used to hold a temporary value.
"

I used r for result in this case which is 100% conform to coding style.

-Andi (trying to exit this thread) 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map
  2008-02-11 14:24         ` Andi Kleen
@ 2008-02-11 14:41           ` Sam Ravnborg
  0 siblings, 0 replies; 37+ messages in thread
From: Sam Ravnborg @ 2008-02-11 14:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Peter Zijlstra, Ingo Molnar, ying.huang, tglx, linux-kernel

> 
> _pfn is variant of hungarian notation (just postfix instead of prefix); 
> that is why I referred to it. 
> 
> Admittedly it was a little unfair pun with Ingo, but he really 
> asked for it  in this case :-)

I thought it was a good joke! It made me smile.

	Sam

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map
  2008-02-11  9:34 ` [PATCH] [6/8] Account overlapped mappings in end_pfn_map Andi Kleen
  2008-02-11 13:08   ` Ingo Molnar
@ 2008-02-11 15:12   ` Arjan van de Ven
  1 sibling, 0 replies; 37+ messages in thread
From: Arjan van de Ven @ 2008-02-11 15:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ying.huang, tglx, mingo, linux-kernel

On Mon, 11 Feb 2008 10:34:34 +0100 (CET)
Andi Kleen <ak@suse.de> wrote:

> 
> When end_pfn is not aligned to 2MB (or 1GB) then the kernel might
> map more memory than end_pfn. Account this in end_pfn_mapped.

we need to fix this btw; mapping memory that doesn't really exist
is a rather nasty trap with cpu prefetching and the like.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd
  2008-02-11 13:05     ` Andi Kleen
  2008-02-11 13:33       ` Ingo Molnar
@ 2008-02-12 10:35       ` Yasunori Goto
  2008-02-12 11:20         ` Andi Kleen
  1 sibling, 1 reply; 37+ messages in thread
From: Yasunori Goto @ 2008-02-12 10:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: tglx, linux-kernel, Andi Kleen

Hi Ingo-san.

> > Does anyone even  
> > use memory hotplug currently? 
> 
> I don't know.

IBM's powerpc box can memory hot-add/remove by dynamic partitioning.
And our fujitsu server has memory hot-add feature (Ia-64).
So, they are concrete user of memory hotplug.

In x86, E8500 chipset has the feature of memory-hotplug.
(I searched a data-sheet from intel site.)
http://download.intel.com/design/chipsets/e8500/datashts/30674501.pdf
(6.3.8 IMI Hot-Plug)

So, it depends on how many server uses it, I think.

Thanks.

-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd
  2008-02-12 10:35       ` Yasunori Goto
@ 2008-02-12 11:20         ` Andi Kleen
  0 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-12 11:20 UTC (permalink / raw)
  To: Yasunori Goto; +Cc: Ingo Molnar, tglx, linux-kernel

On Tuesday 12 February 2008 11:35:22 Yasunori Goto wrote:
> Hi Ingo-san.
> 
> > > Does anyone even  
> > > use memory hotplug currently? 
> > 
> > I don't know.
> 
> IBM's powerpc box can memory hot-add/remove by dynamic partitioning.
> And our fujitsu server has memory hot-add feature (Ia-64).
> So, they are concrete user of memory hotplug.
> 
> In x86, E8500 chipset has the feature of memory-hotplug.
> (I searched a data-sheet from intel site.)
> http://download.intel.com/design/chipsets/e8500/datashts/30674501.pdf
> (6.3.8 IMI Hot-Plug)
> 
> So, it depends on how many server uses it, I think.

With the logic pud error i found and that seems currently will stay
for the forseeable future in the tree I don't think x86-64 64bit hotadds for more
than 1GB have ever worked with the sparsemem method.

Older trees (before 2.6.24) had also a different method that did
preallocate everything based on the SRAT. That one should have worked 
for this case.

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [7/8] Implement true end_pfn_mapped for 32bit
  2008-02-11  9:34 ` [PATCH] [7/8] Implement true end_pfn_mapped for 32bit Andi Kleen
@ 2008-02-12 19:39   ` Thomas Gleixner
  2008-02-12 19:49     ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2008-02-12 19:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ying.huang, mingo, linux-kernel

On Mon, 11 Feb 2008, Andi Kleen wrote:
> Even on 32bit 2MB pages can map more memory than is in the true
> max_low_pfn if end_pfn is not highmem and not aligned to 2MB. 
> Add a end_pfn_map similar to x86-64 that accounts for this 
> fact. This is important for code that really needs to know about
> all mapping aliases. Needed for followup patches (in this case EFI)

It's not only necessary for followup patches, it is a question of
general correctness.

> @@ -36,7 +36,7 @@
>  #define max_pfn_mapped		end_pfn_map
>  #else
>  #include <asm/page_32.h>
> -#define max_pfn_mapped		max_low_pfn
> +#define max_pfn_mapped		end_pfn_map

We can nuke either max_pfn_mapped or end_pfn_map completely. I don't
care about which one, but keeping both makes no sense at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [7/8] Implement true end_pfn_mapped for 32bit
  2008-02-12 19:39   ` Thomas Gleixner
@ 2008-02-12 19:49     ` Andi Kleen
  2008-02-12 20:25       ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-12 19:49 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: ying.huang, mingo, linux-kernel

On Tuesday 12 February 2008 20:39:48 Thomas Gleixner wrote:
> On Mon, 11 Feb 2008, Andi Kleen wrote:
> > Even on 32bit 2MB pages can map more memory than is in the true
> > max_low_pfn if end_pfn is not highmem and not aligned to 2MB. 
> > Add a end_pfn_map similar to x86-64 that accounts for this 
> > fact. This is important for code that really needs to know about
> > all mapping aliases. Needed for followup patches (in this case EFI)
> 
> It's not only necessary for followup patches, it is a question of
> general correctness.

True. ioremap already requires it.

> > @@ -36,7 +36,7 @@
> >  #define max_pfn_mapped		end_pfn_map
> >  #else
> >  #include <asm/page_32.h>
> > -#define max_pfn_mapped		max_low_pfn
> > +#define max_pfn_mapped		end_pfn_map
> 
> We can nuke either max_pfn_mapped or end_pfn_map completely. I don't
> care about which one, but keeping both makes no sense at all.

I didn't want to bundle such a clean up into the bug fix
because my experience is that you usually reject that categorically.

I can send the removal of max_pfn_mapped as a follow up patch
if you apply this one.

-Andi


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [8/8] RFC: Fix some EFI problems
  2008-02-11  9:34 ` [PATCH] [8/8] RFC: Fix some EFI problems Andi Kleen
@ 2008-02-12 20:04   ` Thomas Gleixner
  2008-02-12 20:23     ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2008-02-12 20:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ying.huang, mingo, linux-kernel

On Mon, 11 Feb 2008, Andi Kleen wrote:

> >From code review the EFI memory map handling has a couple of problems:
> 
> - The test for _WB memory was reversed so it would set cache able memory
> to uncached
> - It would always set a wrong uninitialized zero address to uncached
> (so I suspect it always set the first few pages in phys memory to uncached,
> that is why it may have gone unnoticed) 
> - It would call set_memory_x() on a fixmap address that it doesn't
> handle correct.
> - Some other problems I commented in the code (but was unable to solve
> for now) 
> 
> I changed the ioremaps to set the correct caching attributes
> and also corrected the ordering so it looks roughly correct now.

The only effective change is:

-               if (md->attribute & EFI_MEMORY_WB)
+               if (!(md->attribute & EFI_MEMORY_WB))

I appreciate that you noticed the reverse logic, which I messed up
when I fixed up rejects.

I pulled this out as it is a real fix. The rest of this patch is just
turning code in circles for nothing, simply because it is functionally
completely irrelevant whether does simply:

        if ((end >> PAGE_SHIFT) <= max_pfn_mapped)
                va = __va(md->phys_addr);
        else
                va = efi_ioremap(md->phys_addr, size);

       if (!(md->attribute & EFI_MEMORY_WB))
                set_memory_uc(md->virt_addr, size);
or

       if ((end >> PAGE_SHIFT) <= max_pfn_mapped) {
                va = __va(md->phys_addr);

                if (!(md->attribute & EFI_MEMORY_WB))
                        set_memory_uc(md->virt_addr, size);
       } else
                va = efi_ioremap(md->phys_addr, size,
                                 !!(md->attribute & EFI_MEMORY_WB));

And you just copied the real bug in that logic as well:

          set_memory_uc(md->virt_addr, size);
------------------------^^^^^^^^

which is initialized a couple of lines down.

	md->virt_addr = (u64) (unsigned long) va;

The reordering/optimizing needs to be a separate patch.

Please keep bugfixes and other changes separate.
 
> +		/* RED-PEN does not handle overlapped areas */

Can you please use CHECKME/FIXME which is used everywhere else. No need to
invent an extra marker.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [8/8] RFC: Fix some EFI problems
  2008-02-12 20:04   ` Thomas Gleixner
@ 2008-02-12 20:23     ` Andi Kleen
  2008-02-12 20:48       ` Thomas Gleixner
  0 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2008-02-12 20:23 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: ying.huang, mingo, linux-kernel

On Tuesday 12 February 2008 21:04:06 Thomas Gleixner wrote:

> 
> And you just copied the real bug in that logic as well:
> 
>           set_memory_uc(md->virt_addr, size);

Oops you're right. I wanted to fix that, but didn't. Ok I'll put up
my brown paper back tonight when I go out.
 
> ------------------------^^^^^^^^
> 
> which is initialized a couple of lines down.
> 
> 	md->virt_addr = (u64) (unsigned long) va;
> 
> The reordering/optimizing needs to be a separate patch.

What optimizing? It wasn't intended to be an optimization.
It fixes a bug.

Not doing set_memory_uc on efi_ioremap output is needed because 
set_memory_uc doesn't work on fixmap which is what efi_ioremap
returns. 

(see previous mails on that topic -- i fixed the 'x' case,
but fixing "uc" is too hard imho) 

So I fixed efi_ioremap instead to set the correct caching
mode directly. That is ok because there can be no overlap 
with the direct mapping, so no aliases to fix up.


> Please keep bugfixes and other changes separate.
>  
> > +		/* RED-PEN does not handle overlapped areas */
> 
> Can you please use CHECKME/FIXME which is used everywhere else. No need to
> invent an extra marker.

I've always used RED-PEN

% grep -r RED-PEN arch/x86/* | wc -l
12
%

It comes originally from network code I hacked a long time ago, although
most of those got lost over time (only 2 left, sniff) 

Sorry I don't want to change this now and I doubt that will really cause
a problem for anybody.

I'll send an updated patch with the va thing fixed.

-Andi


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [7/8] Implement true end_pfn_mapped for 32bit
  2008-02-12 19:49     ` Andi Kleen
@ 2008-02-12 20:25       ` Thomas Gleixner
  0 siblings, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2008-02-12 20:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ying.huang, mingo, linux-kernel

On Tue, 12 Feb 2008, Andi Kleen wrote:
> > > @@ -36,7 +36,7 @@
> > >  #define max_pfn_mapped		end_pfn_map
> > >  #else
> > >  #include <asm/page_32.h>
> > > -#define max_pfn_mapped		max_low_pfn
> > > +#define max_pfn_mapped		end_pfn_map
> > 
> > We can nuke either max_pfn_mapped or end_pfn_map completely. I don't
> > care about which one, but keeping both makes no sense at all.
> 
> I didn't want to bundle such a clean up into the bug fix
> because my experience is that you usually reject that categorically.

True, but I apply two separate patches.

> I can send the removal of max_pfn_mapped as a follow up patch
> if you apply this one.

Too late. It was so obvious that it screamed to be fixed. I applied
the inital patch already and cleaned it up myself. :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [8/8] RFC: Fix some EFI problems
  2008-02-12 20:23     ` Andi Kleen
@ 2008-02-12 20:48       ` Thomas Gleixner
  2008-02-13 11:05         ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2008-02-12 20:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ying.huang, mingo, linux-kernel

On Tue, 12 Feb 2008, Andi Kleen wrote:

> On Tuesday 12 February 2008 21:04:06 Thomas Gleixner wrote:
> 
> > 
> > And you just copied the real bug in that logic as well:
> > 
> >           set_memory_uc(md->virt_addr, size);
> 
> Oops you're right. I wanted to fix that, but didn't. Ok I'll put up
> my brown paper back tonight when I go out.
>  
> > ------------------------^^^^^^^^
> > 
> > which is initialized a couple of lines down.
> > 
> > 	md->virt_addr = (u64) (unsigned long) va;
> > 
> > The reordering/optimizing needs to be a separate patch.
> 
> What optimizing? It wasn't intended to be an optimization.
> It fixes a bug.

No, it does not. Please go back and read my mail.
 
The code had exactly two bugs:

1) the logic of checking EFI_MEMORY_WB was wrong
2) the uninitialized variable

The fix is:

 arch/x86/kernel/efi.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/kernel/efi.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/efi.c
+++ linux-2.6/arch/x86/kernel/efi.c
@@ -428,9 +428,6 @@ void __init efi_enter_virtual_mode(void)
 		else
 			va = efi_ioremap(md->phys_addr, size);
 
-		if (md->attribute & EFI_MEMORY_WB)
-			set_memory_uc(md->virt_addr, size);
-
 		md->virt_addr = (u64) (unsigned long) va;
 
 		if (!va) {
@@ -439,6 +436,9 @@ void __init efi_enter_virtual_mode(void)
 			continue;
 		}
 
+		if (!(md->attribute & EFI_MEMORY_WB))
+			set_memory_uc(md->virt_addr, size);
+
 		systab = (u64) (unsigned long) efi_phys.systab;
 		if (md->phys_addr <= systab && systab < end) {
 			systab += md->virt_addr - md->phys_addr;

The reordering of code is completely irrelevant. It can be done, but
in a separate patch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] [8/8] RFC: Fix some EFI problems
  2008-02-12 20:48       ` Thomas Gleixner
@ 2008-02-13 11:05         ` Andi Kleen
  0 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2008-02-13 11:05 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: ying.huang, mingo, linux-kernel

Thomas Gleixner wrote:
> On Tue, 12 Feb 2008, Andi Kleen wrote:
> 
>> On Tuesday 12 February 2008 21:04:06 Thomas Gleixner wrote:
>>
>>> And you just copied the real bug in that logic as well:
>>>
>>>           set_memory_uc(md->virt_addr, size);
>> Oops you're right. I wanted to fix that, but didn't. Ok I'll put up
>> my brown paper back tonight when I go out.
>>  
>>> ------------------------^^^^^^^^
>>>
>>> which is initialized a couple of lines down.
>>>
>>> 	md->virt_addr = (u64) (unsigned long) va;
>>>
>>> The reordering/optimizing needs to be a separate patch.
>> What optimizing? It wasn't intended to be an optimization.
>> It fixes a bug.
> 
> No, it does not. Please go back and read my mail.

I describe the problem again:

- efi_ioremap on 64bit returns a fixmap address:
void __iomem * __init efi_ioremap(unsigned long phys_addr, unsigned long
size)
{
       ...
        return (void __iomem *)__fix_to_virt(FIX_EFI_IO_MAP_FIRST_PAGE -
                                             (pages_mapped - pages));

}
- __fix_to_virt is:
 (FIXADDR_TOP - ((x) << PAGE_SHIFT)) and x is a small integer <30 or so.
- Fixmap is
#define VSYSCALL_END (-2UL << 20)
#define FIXADDR_TOP (VSYSCALL_END-PAGE_SIZE)
that gives e.g. 0xffffffdfffff for the top fixmap; the efi fixmap
is only slightly pages below.
- You pass that into set_memory_uc()
- That eventually calls __pa() on it several times
(in static_protections and in change_page_attr_addr for 64bit to
check for the kernel mapping)
- __pa calls __phys_addr which does
unsigned long __phys_addr(unsigned long x)
{
        if (x >= __START_KERNEL_map)
                return x - __START_KERNEL_map + phys_base;
        return x - PAGE_OFFSET;
}
- Now __START_KERNEL_map is 0xffffffff80000000.
- That ends up with

x = 0xffffffdfffff - smallnumber*PAGE_SIZE

if (x >= 0xffffffff80000000)    (evaluates to true)
	return x - 0xffffffff80000000 + phys_addr
- This ends up with some fictional number in cpa (but likely one
looking like a valid pa address) that has nothing
to do with the address that is mapped below the fixmap
- cpa() does weird things to random unrelated memory then or
might clear rw if you're really unlucky.
- I think on 32bit with a real ioremap it's also not completely kosher
with the right __PAGE_OFFSET (but I have not double checked
that step by step)

This is why I avoided calling set_memory_uc for the fixmap
and instead changed the code to set the correct PAT
attribute into the fixmap directly to avoid this.

I believe the full original change or some Thomasized variant of it
is still needed.

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2008-02-13 11:04 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-11  9:34 [PATCH] [0/8] Various kernel mapping bug fixes Andi Kleen
2008-02-11  9:34 ` [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page Andi Kleen
2008-02-11  9:45   ` Thomas Gleixner
2008-02-11 10:12     ` Ingo Molnar
2008-02-11 11:01       ` Andi Kleen
2008-02-11  9:34 ` [PATCH] [2/8] CPA: Flush the caches when setting pages not present Andi Kleen
2008-02-11 11:00   ` Ingo Molnar
2008-02-11 12:26     ` Andi Kleen
2008-02-11  9:34 ` [PATCH] [3/8] CPA: Test the correct mapping alias on x86-64 Andi Kleen
2008-02-11 11:49   ` Ingo Molnar
2008-02-11  9:34 ` [PATCH] [4/8] CPA: Fix set_memory_x for ioremap Andi Kleen
2008-02-11 12:27   ` Ingo Molnar
2008-02-11 12:45     ` Andi Kleen
2008-02-11  9:34 ` [PATCH] [5/8] Fix logic error in 64bit memory hotadd Andi Kleen
2008-02-11 12:46   ` Ingo Molnar
2008-02-11 13:05     ` Andi Kleen
2008-02-11 13:33       ` Ingo Molnar
2008-02-11 13:44         ` Andi Kleen
2008-02-12 10:35       ` Yasunori Goto
2008-02-12 11:20         ` Andi Kleen
2008-02-11  9:34 ` [PATCH] [6/8] Account overlapped mappings in end_pfn_map Andi Kleen
2008-02-11 13:08   ` Ingo Molnar
2008-02-11 13:27     ` Andi Kleen
2008-02-11 13:55       ` Ingo Molnar
2008-02-11 14:16       ` Peter Zijlstra
2008-02-11 14:24         ` Andi Kleen
2008-02-11 14:41           ` Sam Ravnborg
2008-02-11 15:12   ` Arjan van de Ven
2008-02-11  9:34 ` [PATCH] [7/8] Implement true end_pfn_mapped for 32bit Andi Kleen
2008-02-12 19:39   ` Thomas Gleixner
2008-02-12 19:49     ` Andi Kleen
2008-02-12 20:25       ` Thomas Gleixner
2008-02-11  9:34 ` [PATCH] [8/8] RFC: Fix some EFI problems Andi Kleen
2008-02-12 20:04   ` Thomas Gleixner
2008-02-12 20:23     ` Andi Kleen
2008-02-12 20:48       ` Thomas Gleixner
2008-02-13 11:05         ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).