linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels
@ 2008-01-03 15:24 Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [2/28] CPA: Do a simple self test at boot Andi Kleen
                   ` (27 more replies)
  0 siblings, 28 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


No need to make it 64bit there.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/init_32.c        |    4 ++--
 include/asm-x86/pgtable_32.h |    3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -326,9 +326,9 @@ static void __init set_highmem_pages_ini
 #define set_highmem_pages_init(bad_ppro) do { } while (0)
 #endif /* CONFIG_HIGHMEM */
 
-unsigned long long __PAGE_KERNEL = _PAGE_KERNEL;
+typeof_field(pgprot_t, pgprot) __PAGE_KERNEL = _PAGE_KERNEL;
 EXPORT_SYMBOL(__PAGE_KERNEL);
-unsigned long long __PAGE_KERNEL_EXEC = _PAGE_KERNEL_EXEC;
+typeof_field(pgprot_t, pgprot) __PAGE_KERNEL_EXEC = _PAGE_KERNEL_EXEC;
 
 #ifdef CONFIG_NUMA
 extern void __init remap_numa_kva(void);
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -153,7 +153,8 @@ void paging_init(void);
 #define _PAGE_KERNEL_EXEC \
 	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)
 
-extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
+#define typeof_field(t, m) typeof(((t *)0)->m)
+extern typeof_field(pgprot_t, pgprot) __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
 #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
 #define __PAGE_KERNEL_RX		(__PAGE_KERNEL_EXEC & ~_PAGE_RW)
 #define __PAGE_KERNEL_NOCACHE		(__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [2/28] CPA: Do a simple self test at boot
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [3/28] Add pte accessors for the global bit Andi Kleen
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


When CONFIG_DEBUG_RODATA is enabled undo the ro mapping and redo it again.
This gives some simple testing for change_page_attr()

Optional patch, but I find it useful.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/Kconfig.debug |    5 +++++
 arch/x86/mm/init_32.c  |   22 ++++++++++++++++++++++
 arch/x86/mm/init_64.c  |   10 ++++++++++
 3 files changed, 37 insertions(+)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -596,6 +596,16 @@ void mark_rodata_ro(void)
 	 * of who is the culprit.
 	 */
 	global_flush_tlb();
+
+#ifdef CONFIG_CPA_DEBUG
+	printk("Testing CPA: undo %lx-%lx\n", start, end);
+	change_page_attr_addr(start, (end - start) >> PAGE_SHIFT, PAGE_KERNEL);
+	global_flush_tlb();
+
+	printk("Testing CPA: again\n");
+	change_page_attr_addr(start, (end - start) >> PAGE_SHIFT, PAGE_KERNEL_RO);
+	global_flush_tlb();
+#endif
 }
 #endif
 
Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -792,6 +792,18 @@ void mark_rodata_ro(void)
 		change_page_attr(virt_to_page(start),
 		                 size >> PAGE_SHIFT, PAGE_KERNEL_RX);
 		printk("Write protecting the kernel text: %luk\n", size >> 10);
+
+#ifdef CONFIG_CPA_DEBUG
+		global_flush_tlb();
+
+		printk("Testing CPA: undo %lx-%lx\n", start, start+size);
+		change_page_attr(virt_to_page(start), size>>PAGE_SHIFT, PAGE_KERNEL_EXEC);
+		global_flush_tlb();
+
+		printk("Testing CPA: write protecting again\n");
+		change_page_attr(virt_to_page(start),size>>PAGE_SHIFT,PAGE_KERNEL_RX);
+		global_flush_tlb();
+#endif
 	}
 #endif
 	start += size;
@@ -808,6 +820,16 @@ void mark_rodata_ro(void)
 	 * of who is the culprit.
 	 */
 	global_flush_tlb();
+
+#ifdef CONFIG_CPA_DEBUG
+	printk("Testing CPA: undo %lx-%lx\n", start, start + size);
+	change_page_attr(virt_to_page(start), size >> PAGE_SHIFT, PAGE_KERNEL);
+	global_flush_tlb();
+
+	printk("Testing CPA: write protecting again\n");
+	change_page_attr(virt_to_page(start), size >> PAGE_SHIFT, PAGE_KERNEL_RO);
+	global_flush_tlb();
+#endif
 }
 #endif
 
Index: linux/arch/x86/Kconfig.debug
===================================================================
--- linux.orig/arch/x86/Kconfig.debug
+++ linux/arch/x86/Kconfig.debug
@@ -185,4 +185,9 @@ config DEFAULT_IO_DELAY_TYPE
 	default IO_DELAY_TYPE_NONE
 endif
 
+config CPA_DEBUG
+	bool "CPA self test code"
+	help
+	  Do change_page_attr self tests at boot.
+
 endmenu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [3/28] Add pte accessors for the global bit
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [2/28] CPA: Do a simple self test at boot Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [4/28] Add pte_clrhuge on i386 Andi Kleen
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Needed for some test code.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-x86/pgtable_32.h |    3 +++
 include/asm-x86/pgtable_64.h |    3 +++
 2 files changed, 6 insertions(+)

Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -275,6 +275,7 @@ static inline int pte_young(pte_t pte)		
 static inline int pte_write(pte_t pte)		{ return pte_val(pte) & _PAGE_RW; }
 static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
 static inline int pte_huge(pte_t pte)		{ return pte_val(pte) & _PAGE_PSE; }
+static inline int pte_global(pte_t pte)		{ return pte_val(pte) & _PAGE_GLOBAL; }
 
 static inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }
 static inline pte_t pte_mkold(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; }
@@ -285,6 +286,8 @@ static inline pte_t pte_mkyoung(pte_t pt
 static inline pte_t pte_mkwrite(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
 static inline pte_t pte_mkhuge(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_PSE)); return pte; }
 static inline pte_t pte_clrhuge(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_PSE)); return pte; }
+static inline pte_t pte_mkglobal(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_GLOBAL)); return pte; }
+static inline pte_t pte_clrglobal(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_GLOBAL)); return pte; }
 
 struct vm_area_struct;
 
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -220,6 +220,7 @@ static inline int pte_dirty(pte_t pte)		
 static inline int pte_young(pte_t pte)		{ return (pte).pte_low & _PAGE_ACCESSED; }
 static inline int pte_write(pte_t pte)		{ return (pte).pte_low & _PAGE_RW; }
 static inline int pte_huge(pte_t pte)		{ return (pte).pte_low & _PAGE_PSE; }
+static inline int pte_global(pte_t pte)		{ return (pte).pte_low & _PAGE_GLOBAL; }
 
 /*
  * The following only works if pte_present() is not true.
@@ -233,6 +234,8 @@ static inline pte_t pte_mkdirty(pte_t pt
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
 static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= _PAGE_PSE; return pte; }
+static inline pte_t pte_mkglobal(pte_t pte)	{ (pte).pte_low |= _PAGE_GLOBAL; return pte; }
+static inline pte_t pte_clrglobal(pte_t pte)	{ (pte).pte_low &= ~_PAGE_GLOBAL; return pte; }
 
 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [4/28] Add pte_clrhuge on i386
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [2/28] CPA: Do a simple self test at boot Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [3/28] Add pte accessors for the global bit Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-05  6:48   ` Jeremy Fitzhardinge
  2008-01-03 15:24 ` [PATCH CPA] [5/28] Add pte_pgprot " Andi Kleen
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


64bit had it already

Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-x86/pgtable_32.h |    1 +
 1 file changed, 1 insertion(+)

Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -234,6 +234,7 @@ static inline pte_t pte_mkdirty(pte_t pt
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
 static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= _PAGE_PSE; return pte; }
+static inline pte_t pte_clrhuge(pte_t pte)	{ (pte).pte_low &= ~_PAGE_PSE; return pte; }
 static inline pte_t pte_mkglobal(pte_t pte)	{ (pte).pte_low |= _PAGE_GLOBAL; return pte; }
 static inline pte_t pte_clrglobal(pte_t pte)	{ (pte).pte_low &= ~_PAGE_GLOBAL; return pte; }
 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [5/28] Add pte_pgprot on i386
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (2 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [4/28] Add pte_clrhuge on i386 Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [6/28] CPA: Undo white space changes Andi Kleen
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


64bit already had it.

Needed for later patches.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-x86/pgtable-2level.h |    2 ++
 include/asm-x86/pgtable-3level.h |    2 ++
 2 files changed, 4 insertions(+)

Index: linux/include/asm-x86/pgtable-2level.h
===================================================================
--- linux.orig/include/asm-x86/pgtable-2level.h
+++ linux/include/asm-x86/pgtable-2level.h
@@ -76,6 +76,8 @@ static inline int pte_exec_kernel(pte_t 
 #define pgoff_to_pte(off) \
 	((pte_t) { (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })
 
+#define pte_pgprot(x) __pgprot((x).pte_low & 0xfff)
+
 /* Encode and de-code a swap entry */
 #define __swp_type(x)			(((x).val >> 1) & 0x1f)
 #define __swp_offset(x)			((x).val >> 8)
Index: linux/include/asm-x86/pgtable-3level.h
===================================================================
--- linux.orig/include/asm-x86/pgtable-3level.h
+++ linux/include/asm-x86/pgtable-3level.h
@@ -172,6 +172,8 @@ static inline pmd_t pfn_pmd(unsigned lon
 		      pgprot_val(pgprot)) & __supported_pte_mask);
 }
 
+#define pte_pgprot(x) __pgprot(pte_val(x) & (0xfff | _PAGE_NX))
+
 /*
  * Bits 0, 6 and 7 are taken in the low part of the pte,
  * put the 32 bits of offset into the high part.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [6/28] CPA: Undo white space changes
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (3 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [5/28] Add pte_pgprot " Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [7/28] Extract page table dumping code from i386 fault handler into dump_pagetable() Andi Kleen
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Undo random white space changes. This reverts ddb53b5735793a19dc17bcd98b050f672f28f1ea

I simply don't have the nerves to port a 20+ patch series to the 
reformatted version.  And the patch series changes most lines
anyways and drops the trailing white spaces there.

And since this was a nop losing it for now isn't a problem.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |  151 ++++++++++++++++++++--------------------------
 arch/x86/mm/pageattr_64.c |  143 ++++++++++++++++++-------------------------
 2 files changed, 130 insertions(+), 164 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -1,29 +1,28 @@
-/*
- * Copyright 2002 Andi Kleen, SuSE Labs.
+/* 
+ * Copyright 2002 Andi Kleen, SuSE Labs. 
  * Thanks to Ben LaHaise for precious feedback.
- */
+ */ 
 
+#include <linux/mm.h>
+#include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/module.h>
-#include <linux/sched.h>
 #include <linux/slab.h>
-#include <linux/mm.h>
-
+#include <asm/uaccess.h>
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
-#include <asm/sections.h>
-#include <asm/uaccess.h>
 #include <asm/pgalloc.h>
+#include <asm/sections.h>
 
 static DEFINE_SPINLOCK(cpa_lock);
 static struct list_head df_list = LIST_HEAD_INIT(df_list);
 
-pte_t *lookup_address(unsigned long address)
-{
+
+pte_t *lookup_address(unsigned long address) 
+{ 
 	pgd_t *pgd = pgd_offset_k(address);
 	pud_t *pud;
 	pmd_t *pmd;
-
 	if (pgd_none(*pgd))
 		return NULL;
 	pud = pud_offset(pgd, address);
@@ -34,22 +33,21 @@ pte_t *lookup_address(unsigned long addr
 		return NULL;
 	if (pmd_large(*pmd))
 		return (pte_t *)pmd;
+        return pte_offset_kernel(pmd, address);
+} 
 
-	return pte_offset_kernel(pmd, address);
-}
-
-static struct page *
-split_large_page(unsigned long address, pgprot_t prot, pgprot_t ref_prot)
-{
+static struct page *split_large_page(unsigned long address, pgprot_t prot,
+					pgprot_t ref_prot)
+{ 
+	int i; 
 	unsigned long addr;
 	struct page *base;
 	pte_t *pbase;
-	int i;
 
 	spin_unlock_irq(&cpa_lock);
 	base = alloc_pages(GFP_KERNEL, 0);
 	spin_lock_irq(&cpa_lock);
-	if (!base)
+	if (!base) 
 		return NULL;
 
 	/*
@@ -60,24 +58,22 @@ split_large_page(unsigned long address, 
 	page_private(base) = 0;
 
 	address = __pa(address);
-	addr = address & LARGE_PAGE_MASK;
+	addr = address & LARGE_PAGE_MASK; 
 	pbase = (pte_t *)page_address(base);
 	paravirt_alloc_pt(&init_mm, page_to_pfn(base));
-
 	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
-		set_pte(&pbase[i], pfn_pte(addr >> PAGE_SHIFT,
-					   addr == address ? prot : ref_prot));
+               set_pte(&pbase[i], pfn_pte(addr >> PAGE_SHIFT,
+                                          addr == address ? prot : ref_prot));
 	}
 	return base;
-}
+} 
 
 static void cache_flush_page(struct page *p)
-{
-	void *addr = page_address(p);
+{ 
+	void *adr = page_address(p);
 	int i;
-
 	for (i = 0; i < PAGE_SIZE; i += boot_cpu_data.x86_clflush_size)
-		clflush(addr + i);
+		clflush(adr+i);
 }
 
 static void flush_kernel_map(void *arg)
@@ -87,27 +83,23 @@ static void flush_kernel_map(void *arg)
 
 	/* High level code is not ready for clflush yet */
 	if (0 && cpu_has_clflush) {
-		list_for_each_entry(p, lh, lru)
+		list_for_each_entry (p, lh, lru)
 			cache_flush_page(p);
-	} else {
-		if (boot_cpu_data.x86_model >= 4)
-			wbinvd();
-	}
+	} else if (boot_cpu_data.x86_model >= 4)
+		wbinvd();
 
-	/*
-	 * Flush all to work around Errata in early athlons regarding
-	 * large page flushing.
+	/* Flush all to work around Errata in early athlons regarding 
+	 * large page flushing. 
 	 */
-	__flush_tlb_all();
+	__flush_tlb_all(); 	
 }
 
-static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
-{
-	unsigned long flags;
+static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) 
+{ 
 	struct page *page;
+	unsigned long flags;
 
-	/* change init_mm */
-	set_pte_atomic(kpte, pte);
+	set_pte_atomic(kpte, pte); 	/* change init_mm */
 	if (SHARED_KERNEL_PMD)
 		return;
 
@@ -116,7 +108,6 @@ static void set_pmd_pte(pte_t *kpte, uns
 		pgd_t *pgd;
 		pud_t *pud;
 		pmd_t *pmd;
-
 		pgd = (pgd_t *)page_address(page) + pgd_index(address);
 		pud = pud_offset(pgd, address);
 		pmd = pmd_offset(pud, address);
@@ -125,9 +116,9 @@ static void set_pmd_pte(pte_t *kpte, uns
 	spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
-/*
- * No more special protections in this 2/4MB area - revert to a large
- * page again.
+/* 
+ * No more special protections in this 2/4MB area - revert to a
+ * large page again. 
  */
 static inline void revert_page(struct page *kpte_page, unsigned long address)
 {
@@ -151,11 +142,12 @@ static inline void save_page(struct page
 		list_add(&kpte_page->lru, &df_list);
 }
 
-static int __change_page_attr(struct page *page, pgprot_t prot)
-{
-	struct page *kpte_page;
+static int
+__change_page_attr(struct page *page, pgprot_t prot)
+{ 
+	pte_t *kpte; 
 	unsigned long address;
-	pte_t *kpte;
+	struct page *kpte_page;
 
 	BUG_ON(PageHighMem(page));
 	address = (unsigned long)page_address(page);
@@ -163,17 +155,16 @@ static int __change_page_attr(struct pag
 	kpte = lookup_address(address);
 	if (!kpte)
 		return -EINVAL;
-
 	kpte_page = virt_to_page(kpte);
 	BUG_ON(PageLRU(kpte_page));
 	BUG_ON(PageCompound(kpte_page));
 
-	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
+	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
 		if (!pte_huge(*kpte)) {
-			set_pte_atomic(kpte, mk_pte(page, prot));
+			set_pte_atomic(kpte, mk_pte(page, prot)); 
 		} else {
-			struct page *split;
 			pgprot_t ref_prot;
+			struct page *split;
 
 			ref_prot =
 			((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
@@ -181,19 +172,16 @@ static int __change_page_attr(struct pag
 			split = split_large_page(address, prot, ref_prot);
 			if (!split)
 				return -ENOMEM;
-
-			set_pmd_pte(kpte, address, mk_pte(split, ref_prot));
+			set_pmd_pte(kpte,address,mk_pte(split, ref_prot));
 			kpte_page = split;
 		}
 		page_private(kpte_page)++;
-	} else {
-		if (!pte_huge(*kpte)) {
-			set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
-			BUG_ON(page_private(kpte_page) == 0);
-			page_private(kpte_page)--;
-		} else
-			BUG();
-	}
+	} else if (!pte_huge(*kpte)) {
+		set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
+		BUG_ON(page_private(kpte_page) == 0);
+		page_private(kpte_page)--;
+	} else
+		BUG();
 
 	/*
 	 * If the pte was reserved, it means it was created at boot
@@ -209,7 +197,7 @@ static int __change_page_attr(struct pag
 		}
 	}
 	return 0;
-}
+} 
 
 static inline void flush_map(struct list_head *l)
 {
@@ -223,33 +211,32 @@ static inline void flush_map(struct list
  * than write-back somewhere - some CPUs do not like it when mappings with
  * different caching policies exist. This changes the page attributes of the
  * in kernel linear mapping too.
- *
+ * 
  * The caller needs to ensure that there are no conflicting mappings elsewhere.
  * This function only deals with the kernel linear map.
- *
+ * 
  * Caller must call global_flush_tlb() after this.
  */
 int change_page_attr(struct page *page, int numpages, pgprot_t prot)
 {
+	int err = 0; 
+	int i; 
 	unsigned long flags;
-	int err = 0, i;
 
 	spin_lock_irqsave(&cpa_lock, flags);
-	for (i = 0; i < numpages; i++, page++) {
+	for (i = 0; i < numpages; i++, page++) { 
 		err = __change_page_attr(page, prot);
-		if (err)
-			break;
-	}
+		if (err) 
+			break; 
+	} 	
 	spin_unlock_irqrestore(&cpa_lock, flags);
-
 	return err;
 }
-EXPORT_SYMBOL(change_page_attr);
 
 void global_flush_tlb(void)
 {
-	struct page *pg, *next;
 	struct list_head l;
+	struct page *pg, *next;
 
 	BUG_ON(irqs_disabled());
 
@@ -266,28 +253,26 @@ void global_flush_tlb(void)
 		__free_page(pg);
 	}
 }
-EXPORT_SYMBOL(global_flush_tlb);
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void kernel_map_pages(struct page *page, int numpages, int enable)
 {
 	if (PageHighMem(page))
 		return;
-	if (!enable) {
+	if (!enable)
 		debug_check_no_locks_freed(page_address(page),
 					   numpages * PAGE_SIZE);
-	}
 
-	/*
-	 * the return value is ignored - the calls cannot fail,
+	/* the return value is ignored - the calls cannot fail,
 	 * large pages are disabled at boot time.
 	 */
 	change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0));
-
-	/*
-	 * we should perform an IPI and flush all tlbs,
+	/* we should perform an IPI and flush all tlbs,
 	 * but that can deadlock->flush only current cpu.
 	 */
 	__flush_tlb_all();
 }
 #endif
+
+EXPORT_SYMBOL(change_page_attr);
+EXPORT_SYMBOL(global_flush_tlb);
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -1,54 +1,48 @@
-/*
- * Copyright 2002 Andi Kleen, SuSE Labs.
+/* 
+ * Copyright 2002 Andi Kleen, SuSE Labs. 
  * Thanks to Ben LaHaise for precious feedback.
- */
+ */ 
 
+#include <linux/mm.h>
+#include <linux/sched.h>
 #include <linux/highmem.h>
 #include <linux/module.h>
-#include <linux/sched.h>
 #include <linux/slab.h>
-#include <linux/mm.h>
-
+#include <asm/uaccess.h>
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
-#include <asm/uaccess.h>
 #include <asm/io.h>
 
 pte_t *lookup_address(unsigned long address)
-{
+{ 
 	pgd_t *pgd = pgd_offset_k(address);
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
-
 	if (pgd_none(*pgd))
 		return NULL;
 	pud = pud_offset(pgd, address);
 	if (!pud_present(*pud))
-		return NULL;
+		return NULL; 
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
-		return NULL;
+		return NULL; 
 	if (pmd_large(*pmd))
 		return (pte_t *)pmd;
-
 	pte = pte_offset_kernel(pmd, address);
 	if (pte && !pte_present(*pte))
-		pte = NULL;
-
+		pte = NULL; 
 	return pte;
-}
+} 
 
-static struct page *
-split_large_page(unsigned long address, pgprot_t prot, pgprot_t ref_prot)
-{
+static struct page *split_large_page(unsigned long address, pgprot_t prot,
+				     pgprot_t ref_prot)
+{ 
+	int i; 
 	unsigned long addr;
-	struct page *base;
+	struct page *base = alloc_pages(GFP_KERNEL, 0);
 	pte_t *pbase;
-	int i;
-
-	base = alloc_pages(GFP_KERNEL, 0);
-	if (!base)
+	if (!base) 
 		return NULL;
 	/*
 	 * page_private is used to track the number of entries in
@@ -58,21 +52,20 @@ split_large_page(unsigned long address, 
 	page_private(base) = 0;
 
 	address = __pa(address);
-	addr = address & LARGE_PAGE_MASK;
+	addr = address & LARGE_PAGE_MASK; 
 	pbase = (pte_t *)page_address(base);
 	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
-		pbase[i] = pfn_pte(addr >> PAGE_SHIFT,
+		pbase[i] = pfn_pte(addr >> PAGE_SHIFT, 
 				   addr == address ? prot : ref_prot);
 	}
 	return base;
-}
+} 
 
-void clflush_cache_range(void *addr, int size)
+void clflush_cache_range(void *adr, int size)
 {
 	int i;
-
 	for (i = 0; i < size; i += boot_cpu_data.x86_clflush_size)
-		clflush(addr+i);
+		clflush(adr+i);
 }
 
 static void flush_kernel_map(void *arg)
@@ -83,20 +76,17 @@ static void flush_kernel_map(void *arg)
 	/* When clflush is available always use it because it is
 	   much cheaper than WBINVD. */
 	/* clflush is still broken. Disable for now. */
-	if (1 || !cpu_has_clflush) {
+	if (1 || !cpu_has_clflush)
 		asm volatile("wbinvd" ::: "memory");
-	} else {
-		list_for_each_entry(pg, l, lru) {
-			void *addr = page_address(pg);
-
-			clflush_cache_range(addr, PAGE_SIZE);
-		}
+	else list_for_each_entry(pg, l, lru) {
+		void *adr = page_address(pg);
+		clflush_cache_range(adr, PAGE_SIZE);
 	}
 	__flush_tlb_all();
 }
 
 static inline void flush_map(struct list_head *l)
-{
+{	
 	on_each_cpu(flush_kernel_map, l, 1, 1);
 }
 
@@ -108,56 +98,52 @@ static inline void save_page(struct page
 		list_add(&fpage->lru, &deferred_pages);
 }
 
-/*
+/* 
  * No more special protections in this 2/4MB area - revert to a
- * large page again.
+ * large page again. 
  */
 static void revert_page(unsigned long address, pgprot_t ref_prot)
 {
-	unsigned long pfn;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t large_pte;
+	unsigned long pfn;
 
 	pgd = pgd_offset_k(address);
 	BUG_ON(pgd_none(*pgd));
-	pud = pud_offset(pgd, address);
+	pud = pud_offset(pgd,address);
 	BUG_ON(pud_none(*pud));
 	pmd = pmd_offset(pud, address);
 	BUG_ON(pmd_val(*pmd) & _PAGE_PSE);
 	pfn = (__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT;
 	large_pte = pfn_pte(pfn, ref_prot);
 	large_pte = pte_mkhuge(large_pte);
-
 	set_pte((pte_t *)pmd, large_pte);
-}
+}      
 
 static int
 __change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
-		   pgprot_t ref_prot)
-{
+				   pgprot_t ref_prot)
+{ 
+	pte_t *kpte; 
 	struct page *kpte_page;
 	pgprot_t ref_prot2;
-	pte_t *kpte;
 
 	kpte = lookup_address(address);
-	if (!kpte)
-		return 0;
-
+	if (!kpte) return 0;
 	kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
 	BUG_ON(PageLRU(kpte_page));
 	BUG_ON(PageCompound(kpte_page));
-	if (pgprot_val(prot) != pgprot_val(ref_prot)) {
+	if (pgprot_val(prot) != pgprot_val(ref_prot)) { 
 		if (!pte_huge(*kpte)) {
 			set_pte(kpte, pfn_pte(pfn, prot));
 		} else {
-			/*
+ 			/*
 			 * split_large_page will take the reference for this
 			 * change_page_attr on the split page.
-			 */
+ 			 */
 			struct page *split;
-
 			ref_prot2 = pte_pgprot(pte_clrhuge(*kpte));
 			split = split_large_page(address, prot, ref_prot2);
 			if (!split)
@@ -167,23 +153,21 @@ __change_page_attr(unsigned long address
 			kpte_page = split;
 		}
 		page_private(kpte_page)++;
-	} else {
-		if (!pte_huge(*kpte)) {
-			set_pte(kpte, pfn_pte(pfn, ref_prot));
-			BUG_ON(page_private(kpte_page) == 0);
-			page_private(kpte_page)--;
-		} else
-			BUG();
-	}
+	} else if (!pte_huge(*kpte)) {
+		set_pte(kpte, pfn_pte(pfn, ref_prot));
+		BUG_ON(page_private(kpte_page) == 0);
+		page_private(kpte_page)--;
+	} else
+		BUG();
 
 	/* on x86-64 the direct mapping set at boot is not using 4k pages */
-	BUG_ON(PageReserved(kpte_page));
+ 	BUG_ON(PageReserved(kpte_page));
 
 	save_page(kpte_page);
 	if (page_private(kpte_page) == 0)
 		revert_page(address, ref_prot);
 	return 0;
-}
+} 
 
 /*
  * Change the page attributes of an page in the linear mapping.
@@ -192,19 +176,19 @@ __change_page_attr(unsigned long address
  * than write-back somewhere - some CPUs do not like it when mappings with
  * different caching policies exist. This changes the page attributes of the
  * in kernel linear mapping too.
- *
+ * 
  * The caller needs to ensure that there are no conflicting mappings elsewhere.
  * This function only deals with the kernel linear map.
- *
+ * 
  * Caller must call global_flush_tlb() after this.
  */
 int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
 {
-	int err = 0, kernel_map = 0, i;
-
-	if (address >= __START_KERNEL_map &&
-			address < __START_KERNEL_map + KERNEL_TEXT_SIZE) {
+	int err = 0, kernel_map = 0;
+	int i; 
 
+	if (address >= __START_KERNEL_map
+	    && address < __START_KERNEL_map + KERNEL_TEXT_SIZE) {
 		address = (unsigned long)__va(__pa(address));
 		kernel_map = 1;
 	}
@@ -214,8 +198,7 @@ int change_page_attr_addr(unsigned long 
 		unsigned long pfn = __pa(address) >> PAGE_SHIFT;
 
 		if (!kernel_map || pte_present(pfn_pte(0, prot))) {
-			err = __change_page_attr(address, pfn, prot,
-						PAGE_KERNEL);
+			err = __change_page_attr(address, pfn, prot, PAGE_KERNEL);
 			if (err)
 				break;
 		}
@@ -224,16 +207,14 @@ int change_page_attr_addr(unsigned long 
 		if (__pa(address) < KERNEL_TEXT_SIZE) {
 			unsigned long addr2;
 			pgprot_t prot2;
-
 			addr2 = __START_KERNEL_map + __pa(address);
 			/* Make sure the kernel mappings stay executable */
 			prot2 = pte_pgprot(pte_mkexec(pfn_pte(0, prot)));
 			err = __change_page_attr(addr2, pfn, prot2,
 						 PAGE_KERNEL_EXEC);
-		}
-	}
-	up_write(&init_mm.mmap_sem);
-
+		} 
+	} 	
+	up_write(&init_mm.mmap_sem); 
 	return err;
 }
 
@@ -241,13 +222,11 @@ int change_page_attr_addr(unsigned long 
 int change_page_attr(struct page *page, int numpages, pgprot_t prot)
 {
 	unsigned long addr = (unsigned long)page_address(page);
-
 	return change_page_attr_addr(addr, numpages, prot);
 }
-EXPORT_SYMBOL(change_page_attr);
 
 void global_flush_tlb(void)
-{
+{ 
 	struct page *pg, *next;
 	struct list_head l;
 
@@ -269,6 +248,8 @@ void global_flush_tlb(void)
 			continue;
 		ClearPagePrivate(pg);
 		__free_page(pg);
-	}
-}
+	} 
+} 
+
+EXPORT_SYMBOL(change_page_attr);
 EXPORT_SYMBOL(global_flush_tlb);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [7/28] Extract page table dumping code from i386 fault handler into dump_pagetable()
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (4 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [6/28] CPA: Undo white space changes Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [8/28] CPA: Return the page table level in lookup_address() Andi Kleen
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Similar to x86-64. This is useful in other situations where we want
the page table dumped too.

Besides anything that makes i386 do_page_fault shorter is good.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/fault_32.c |   72 ++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 33 deletions(-)

Index: linux/arch/x86/mm/fault_32.c
===================================================================
--- linux.orig/arch/x86/mm/fault_32.c
+++ linux/arch/x86/mm/fault_32.c
@@ -28,6 +28,44 @@
 #include <asm/desc.h>
 #include <asm/segment.h>
 
+void dump_pagetable(unsigned long address)
+{
+	typeof(pte_val(__pte(0))) page;
+
+	page = read_cr3();
+	page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT];
+#ifdef CONFIG_X86_PAE
+	printk("*pdpt = %016Lx ", page);
+	if ((page >> PAGE_SHIFT) < max_low_pfn
+	    && page & _PAGE_PRESENT) {
+		page &= PAGE_MASK;
+		page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT)
+		                                         & (PTRS_PER_PMD - 1)];
+		printk(KERN_CONT "*pde = %016Lx ", page);
+		page &= ~_PAGE_NX;
+	}
+#else
+	printk("*pde = %08lx ", page);
+#endif
+
+	/*
+	 * We must not directly access the pte in the highpte
+	 * case if the page table is located in highmem.
+	 * And let's rather not kmap-atomic the pte, just in case
+	 * it's allocated already.
+	 */
+	if ((page >> PAGE_SHIFT) < max_low_pfn
+	    && (page & _PAGE_PRESENT)
+	    && !(page & _PAGE_PSE)) {
+		page &= PAGE_MASK;
+		page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT)
+		                                         & (PTRS_PER_PTE - 1)];
+		printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page);
+	}
+
+	printk("\n");
+}
+
 /*
  * Page fault error code bits
  *	bit 0 == 0 means no page found, 1 means protection fault
@@ -602,7 +640,6 @@ no_context:
 	bust_spinlocks(1);
 
 	if (oops_may_print()) {
-		__typeof__(pte_val(__pte(0))) page;
 
 #ifdef CONFIG_X86_PAE
 		if (error_code & 16) {
@@ -623,38 +660,7 @@ no_context:
 		printk(" at virtual address %08lx\n", address);
 		printk(KERN_ALERT "printing ip: %08lx ", regs->ip);
 
-		page = read_cr3();
-		page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT];
-#ifdef CONFIG_X86_PAE
-		printk("*pdpt = %016Lx ", page);
-		if ((page >> PAGE_SHIFT) < max_low_pfn
-		    && page & _PAGE_PRESENT) {
-			page &= PAGE_MASK;
-			page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT)
-			                                         & (PTRS_PER_PMD - 1)];
-			printk(KERN_CONT "*pde = %016Lx ", page);
-			page &= ~_PAGE_NX;
-		}
-#else
-		printk("*pde = %08lx ", page);
-#endif
-
-		/*
-		 * We must not directly access the pte in the highpte
-		 * case if the page table is located in highmem.
-		 * And let's rather not kmap-atomic the pte, just in case
-		 * it's allocated already.
-		 */
-		if ((page >> PAGE_SHIFT) < max_low_pfn
-		    && (page & _PAGE_PRESENT)
-		    && !(page & _PAGE_PSE)) {
-			page &= PAGE_MASK;
-			page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT)
-			                                         & (PTRS_PER_PTE - 1)];
-			printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page);
-		}
-
-		printk("\n");
+		dump_pagetable(address);
 	}
 
 	tsk->thread.cr2 = address;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [8/28] CPA: Return the page table level in lookup_address()
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (5 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [7/28] Extract page table dumping code from i386 fault handler into dump_pagetable() Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [9/28] CPA: Add simple self test at boot Andi Kleen
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Needed for the next change.

And change all the callers.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/fault_32.c       |    3 ++-
 arch/x86/mm/init_32.c        |    3 ++-
 arch/x86/mm/pageattr_32.c    |    7 +++++--
 arch/x86/mm/pageattr_64.c    |    7 +++++--
 arch/x86/xen/mmu.c           |    9 ++++++---
 include/asm-x86/pgtable_32.h |    2 +-
 include/asm-x86/pgtable_64.h |    2 +-
 7 files changed, 22 insertions(+), 11 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,7 +13,7 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
-pte_t *lookup_address(unsigned long address)
+pte_t *lookup_address(unsigned long address, int *level)
 { 
 	pgd_t *pgd = pgd_offset_k(address);
 	pud_t *pud;
@@ -27,8 +27,10 @@ pte_t *lookup_address(unsigned long addr
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return NULL; 
+	*level = 3;
 	if (pmd_large(*pmd))
 		return (pte_t *)pmd;
+	*level = 4;
 	pte = pte_offset_kernel(pmd, address);
 	if (pte && !pte_present(*pte))
 		pte = NULL; 
@@ -129,8 +131,9 @@ __change_page_attr(unsigned long address
 	pte_t *kpte; 
 	struct page *kpte_page;
 	pgprot_t ref_prot2;
+	int level;
 
-	kpte = lookup_address(address);
+	kpte = lookup_address(address, &level);
 	if (!kpte) return 0;
 	kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
 	BUG_ON(PageLRU(kpte_page));
Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -409,7 +409,7 @@ extern struct list_head pgd_list;
 
 extern int kern_addr_valid(unsigned long addr); 
 
-pte_t *lookup_address(unsigned long addr);
+pte_t *lookup_address(unsigned long addr, int *level);
 
 #define io_remap_pfn_range(vma, vaddr, pfn, size, prot)		\
 		remap_pfn_range(vma, vaddr, pfn, size, prot)
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -18,7 +18,7 @@ static DEFINE_SPINLOCK(cpa_lock);
 static struct list_head df_list = LIST_HEAD_INIT(df_list);
 
 
-pte_t *lookup_address(unsigned long address) 
+pte_t *lookup_address(unsigned long address, int *level)
 { 
 	pgd_t *pgd = pgd_offset_k(address);
 	pud_t *pud;
@@ -31,8 +31,10 @@ pte_t *lookup_address(unsigned long addr
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		return NULL;
+	*level = 2;
 	if (pmd_large(*pmd))
 		return (pte_t *)pmd;
+	*level = 3;
         return pte_offset_kernel(pmd, address);
 } 
 
@@ -148,11 +150,12 @@ __change_page_attr(struct page *page, pg
 	pte_t *kpte; 
 	unsigned long address;
 	struct page *kpte_page;
+	int level;
 
 	BUG_ON(PageHighMem(page));
 	address = (unsigned long)page_address(page);
 
-	kpte = lookup_address(address);
+	kpte = lookup_address(address, &level);
 	if (!kpte)
 		return -EINVAL;
 	kpte_page = virt_to_page(kpte);
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -443,7 +443,7 @@ static inline pte_t pte_modify(pte_t pte
  * NOTE: the return type is pte_t but if the pmd is PSE then we return it
  * as a pte too.
  */
-extern pte_t *lookup_address(unsigned long address);
+extern pte_t *lookup_address(unsigned long address, int *level);
 
 /*
  * Make a given kernel text page executable/non-executable.
Index: linux/arch/x86/mm/fault_32.c
===================================================================
--- linux.orig/arch/x86/mm/fault_32.c
+++ linux/arch/x86/mm/fault_32.c
@@ -643,7 +643,8 @@ no_context:
 
 #ifdef CONFIG_X86_PAE
 		if (error_code & 16) {
-			pte_t *pte = lookup_address(address);
+			int level;
+			pte_t *pte = lookup_address(address, &level);
 
 			if (pte && pte_present(*pte) && !pte_exec_kernel(*pte))
 				printk(KERN_CRIT "kernel tried to execute "
Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -528,11 +528,12 @@ int __init set_kernel_exec(unsigned long
 {
 	pte_t *pte;
 	int ret = 1;
+	int level;
 
 	if (!nx_enabled)
 		goto out;
 
-	pte = lookup_address(vaddr);
+	pte = lookup_address(vaddr, &level);
 	BUG_ON(!pte);
 
 	if (!pte_exec_kernel(*pte))
Index: linux/arch/x86/xen/mmu.c
===================================================================
--- linux.orig/arch/x86/xen/mmu.c
+++ linux/arch/x86/xen/mmu.c
@@ -58,7 +58,8 @@
 
 xmaddr_t arbitrary_virt_to_machine(unsigned long address)
 {
-	pte_t *pte = lookup_address(address);
+	int level;
+	pte_t *pte = lookup_address(address, &level);
 	unsigned offset = address & PAGE_MASK;
 
 	BUG_ON(pte == NULL);
@@ -70,8 +71,9 @@ void make_lowmem_page_readonly(void *vad
 {
 	pte_t *pte, ptev;
 	unsigned long address = (unsigned long)vaddr;
+	int level;
 
-	pte = lookup_address(address);
+	pte = lookup_address(address, &level);
 	BUG_ON(pte == NULL);
 
 	ptev = pte_wrprotect(*pte);
@@ -84,8 +86,9 @@ void make_lowmem_page_readwrite(void *va
 {
 	pte_t *pte, ptev;
 	unsigned long address = (unsigned long)vaddr;
+	int level;
 
-	pte = lookup_address(address);
+	pte = lookup_address(address, &level);
 	BUG_ON(pte == NULL);
 
 	ptev = pte_mkwrite(*pte);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [9/28] CPA: Add simple self test at boot
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (6 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [8/28] CPA: Return the page table level in lookup_address() Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [10/28] CPA: Change kernel_map_pages to not use c_p_a() Andi Kleen
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Since change_page_attr() is tricky code it is good to have some regression
test code. This patch maps and unmaps some random pages in the direct mapping
at boot and then dumps the state.

One disadvantage is that the results currently have to be checked
manually, it would be better if it had a global passed/failed message.

Add it with a CONFIG option.

Optional patch, but I find it useful.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/Makefile_32     |    1 
 arch/x86/mm/Makefile_64     |    1 
 arch/x86/mm/pageattr-test.c |  187 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 189 insertions(+)

Index: linux/arch/x86/mm/Makefile_64
===================================================================
--- linux.orig/arch/x86/mm/Makefile_64
+++ linux/arch/x86/mm/Makefile_64
@@ -7,3 +7,4 @@ obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpag
 obj-$(CONFIG_NUMA) += numa_64.o
 obj-$(CONFIG_K8_NUMA) += k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA) += srat_64.o
+obj-$(CONFIG_CPA_DEBUG) += pageattr-test.o
Index: linux/arch/x86/mm/pageattr-test.c
===================================================================
--- /dev/null
+++ linux/arch/x86/mm/pageattr-test.c
@@ -0,0 +1,187 @@
+/*
+ * self test for change_page_attr.
+ *
+ * Clears the global bit on random pages in the direct mapping, then reverts and
+ * compares page tables forwards and afterwards.
+ */
+
+#include <linux/mm.h>
+#include <linux/random.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <asm/cacheflush.h>
+#include <asm/pgtable.h>
+#include <asm/kdebug.h>
+
+enum {
+	NTEST = 25,
+#ifdef CONFIG_X86_64
+	LOWEST_LEVEL = 4,
+	LPS = (1 << PMD_SHIFT),
+#elif defined(CONFIG_X86_PAE)
+	LOWEST_LEVEL = 3,
+	LPS = (1 << PMD_SHIFT),
+#else
+	LOWEST_LEVEL = 3, /* lookup_address lies here */
+	LPS = (1 << 22),
+#endif
+	GPS = (1<<30)
+};
+
+#ifdef CONFIG_X86_64
+#include <asm/proto.h>
+#define max_mapped end_pfn_map
+#else
+#define max_mapped max_low_pfn
+#endif
+
+struct split_state {
+	long lpg, gpg, spg, nx;
+};
+
+static __init int print_split(struct split_state *s)
+{
+	int printed = 0;
+	long i, expected, missed = 0;
+	s->lpg = s->gpg = s->spg = s->nx = 0;
+	for (i = 0; i < max_mapped; ) {
+		int level;
+		pte_t *pte;
+		unsigned long adr = (unsigned long)__va(i << PAGE_SHIFT);
+
+		pte = lookup_address(adr, &level);
+		if (!pte) {
+			if (!printed) {
+				dump_pagetable(adr);
+				printk("CPA %lx no pte level %d\n", adr, level);
+				printed = 1;
+			}
+			missed++;
+			i++;
+			continue;
+		}
+
+		if (level == 2 && sizeof(long) == 8) {
+			s->gpg++;
+			i += GPS/PAGE_SIZE;
+		} else if (level != LOWEST_LEVEL) {
+			s->lpg++;
+			i += LPS/PAGE_SIZE;
+		} else {
+			s->spg++;
+			i++;
+		}
+		if (pte_val(*pte) & _PAGE_NX)
+			s->nx++;
+	}
+	printk("CPA direct mapping small %lu large %lu giga %lu nx %lu missed %lu\n",
+		s->spg, s->lpg, s->gpg, s->nx, missed);
+	expected = (s->gpg*GPS + s->lpg*LPS)/PAGE_SIZE + s->spg + missed;
+	if (expected != max_mapped) {
+		printk("CPA max_mapped %lu but expected %lu\n",
+			max_mapped, expected);
+		return 1;
+	}
+	return 0;
+}
+
+static __init int state_same(struct split_state *a, struct split_state *b)
+{
+	return a->lpg == b->lpg && a->gpg == b->gpg && a->spg == b->spg &&
+			a->nx == b->nx;
+}
+
+static unsigned long addr[NTEST] __initdata;
+static unsigned len[NTEST] __initdata;
+
+/* Change the global bit on random pages in the direct mapping */
+static __init int exercise_pageattr(void)
+{
+	int i;
+	pte_t *pte;
+	int level;
+	int err;
+	struct split_state sa, sb, sc;
+	int failed = 0;
+
+	printk("CPA exercising pageattr\n");
+	failed += print_split(&sa);
+	srandom32(100);
+	for (i = 0; i < NTEST; i++) {
+		unsigned long pfn = random32() % max_mapped;
+		addr[i] = (unsigned long)__va(pfn << PAGE_SHIFT);
+		len[i] = min_t(unsigned long, random32() % 100, max_mapped - pfn);
+		pte = lookup_address(addr[i], &level);
+		if (!pte) {
+			printk("lookup of %lx failed\n", addr[i]);
+			failed++;
+			continue;
+		}
+		if (pgprot_val(pte_pgprot(*pte)) == 0) {
+			addr[i] = 0;
+			continue;
+		}
+
+		err = change_page_attr(virt_to_page(addr[i]), len[i],
+			    pte_pgprot(pte_clrhuge(pte_clrglobal(*pte))));
+		if (err < 0) {
+			printk("cpa %d failed %d\n", i, err);
+			failed++;
+		}
+
+		pte = lookup_address(addr[i], &level);
+		if (!pte || pte_global(*pte) || pte_huge(*pte)) {
+			printk("%lx: bad pte %Lx\n", addr[i],
+				pte ? (u64)pte_val(*pte) : 0ULL);
+			failed++;
+		}
+		if (level != LOWEST_LEVEL) {
+			printk("%lx: unexpected level %d\n", addr[i], level);
+			failed++;
+		}
+
+	}
+	global_flush_tlb();
+
+	failed += print_split(&sb);
+
+	printk("CPA reverting everything\n");
+	for (i = 0; i < NTEST; i++) {
+		if (!addr[i])
+			continue;
+		pte = lookup_address(addr[i], &level);
+		if (!pte) {
+			printk("lookup of %lx failed\n", addr[i]);
+			failed++;
+			continue;
+		}
+		err = change_page_attr(virt_to_page(addr[i]), len[i],
+					  pte_pgprot(pte_mkglobal(*pte)));
+		if (err < 0) {
+			printk("cpa2 failed: %d\n", err);
+			failed++;
+		}
+		pte = lookup_address(addr[i], &level);
+		if (!pte || !pte_global(*pte)) {
+			printk("%lx: bad pte %Lx\n", addr[i],
+			       pte ? (u64)pte_val(*pte) : 0ULL);
+			failed++;
+		}
+
+	}
+	global_flush_tlb();
+
+	failed += print_split(&sc);
+	if (!state_same(&sa, &sc))
+		failed++;
+
+	if (failed)
+		printk("CPA selftests NOT PASSED. Please report.\n");
+	else
+		printk("CPA selftests PASSED\n");
+
+	return 0;
+}
+
+module_init(exercise_pageattr);
Index: linux/arch/x86/mm/Makefile_32
===================================================================
--- linux.orig/arch/x86/mm/Makefile_32
+++ linux/arch/x86/mm/Makefile_32
@@ -8,3 +8,4 @@ obj-$(CONFIG_NUMA) += discontig_32.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_HIGHMEM) += highmem_32.o
 obj-$(CONFIG_BOOT_IOREMAP) += boot_ioremap_32.o
+obj-$(CONFIG_CPA_DEBUG) += pageattr-test.o

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [10/28] CPA: Change kernel_map_pages to not use c_p_a()
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (7 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [9/28] CPA: Add simple self test at boot Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [11/28] CPA: Change 32bit back to init_mm semaphore locking Andi Kleen
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


CONFIG_DEBUG_PAGEALLOC uses change_page_attr to map/unmap mappings for catching
stray kernel mappings. But standard c_p_a() does a lot of unnecessary work for
this simple case with pre-split mappings.

Change kernel_map_pages to just access the page table directly which
is simpler and faster.

I also fixed it to use INVLPG if available.

This is required for  changes to c_p_a() later that make it use kmalloc. Without
this we would risk infinite recursion. Also in general things are easier when
sleeping is allowed.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -258,22 +258,36 @@ void global_flush_tlb(void)
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
+/* Map or unmap pages in the kernel direct mapping for kernel debugging. */
 void kernel_map_pages(struct page *page, int numpages, int enable)
 {
+	unsigned long addr;
+	int i;
+
 	if (PageHighMem(page))
 		return;
+ 	addr = (unsigned long)page_address(page);
 	if (!enable)
-		debug_check_no_locks_freed(page_address(page),
-					   numpages * PAGE_SIZE);
+		debug_check_no_locks_freed((void *)addr, numpages * PAGE_SIZE);
+
+	/* Bootup has forced 4K pages so this is very simple */
+
+	for (i = 0; i < numpages; i++, addr += PAGE_SIZE, page++) {
+		int level;
+		pte_t *pte = lookup_address(addr, &level);
 
-	/* the return value is ignored - the calls cannot fail,
-	 * large pages are disabled at boot time.
-	 */
-	change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0));
-	/* we should perform an IPI and flush all tlbs,
-	 * but that can deadlock->flush only current cpu.
-	 */
-	__flush_tlb_all();
+		BUG_ON(level != 3);
+		if (enable) {
+			set_pte_atomic(pte, mk_pte(page, PAGE_KERNEL));
+			/*
+			 * We should perform an IPI and flush all tlbs,
+			 * but that can deadlock->flush only current cpu.
+		 	 */
+			__flush_tlb_one(addr);
+		} else {
+			kpte_clear_flush(pte, addr);
+		}
+	}
 }
 #endif
 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [11/28] CPA: Change 32bit back to init_mm semaphore locking
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (8 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [10/28] CPA: Change kernel_map_pages to not use c_p_a() Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [12/28] CPA: CLFLUSH support in change_page_attr() Andi Kleen
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Now that debug pagealloc uses a separate function it is better
to change standard change_page_attr back to init_mm semaphore locking like 64bit.
Various things are simpler when sleeping is allowed.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -14,10 +14,9 @@
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
 
-static DEFINE_SPINLOCK(cpa_lock);
+/* Protected by init_mm.mmap_sem */
 static struct list_head df_list = LIST_HEAD_INIT(df_list);
 
-
 pte_t *lookup_address(unsigned long address, int *level)
 { 
 	pgd_t *pgd = pgd_offset_k(address);
@@ -46,9 +45,7 @@ static struct page *split_large_page(uns
 	struct page *base;
 	pte_t *pbase;
 
-	spin_unlock_irq(&cpa_lock);
 	base = alloc_pages(GFP_KERNEL, 0);
-	spin_lock_irq(&cpa_lock);
 	if (!base) 
 		return NULL;
 
@@ -224,15 +221,14 @@ int change_page_attr(struct page *page, 
 {
 	int err = 0; 
 	int i; 
-	unsigned long flags;
 
-	spin_lock_irqsave(&cpa_lock, flags);
+	down_write(&init_mm.mmap_sem);
 	for (i = 0; i < numpages; i++, page++) { 
 		err = __change_page_attr(page, prot);
 		if (err) 
 			break; 
 	} 	
-	spin_unlock_irqrestore(&cpa_lock, flags);
+	up_write(&init_mm.mmap_sem);
 	return err;
 }
 
@@ -243,9 +239,9 @@ void global_flush_tlb(void)
 
 	BUG_ON(irqs_disabled());
 
-	spin_lock_irq(&cpa_lock);
+	down_write(&init_mm.mmap_sem);
 	list_replace_init(&df_list, &l);
-	spin_unlock_irq(&cpa_lock);
+	up_write(&init_mm.mmap_sem);
 	flush_map(&l);
 	list_for_each_entry_safe(pg, next, &l, lru) {
 		list_del(&pg->lru);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [12/28] CPA: CLFLUSH support in change_page_attr()
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (9 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [11/28] CPA: Change 32bit back to init_mm semaphore locking Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [13/28] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr Andi Kleen
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Queue individual data pages for flushing with CLFLUSH in change_page_attr(),
instead of doing global WBINVDs. WBINVDs is a very painful operation
for the CPU and quite slow too.  Worse it is not interruptible and
can cause long latencies on hypervisors on older Intel VT systems.

CLFLUSH on the other hand only flushes the cache lines that actually need to be 
flushed and since it works in smaller chunks is more preemeptible.

To do this c_p_a needs to save the address to be flush for global_tlb_flush()
later.  This is done using a separate data structure, not struct page, 
because page->lru is often used or not there for memory holes.

Also the flushes are done in FIFO order now, not LIFO. 

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   78 ++++++++++++++++++++++++++++++++++------------
 arch/x86/mm/pageattr_64.c |   77 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 118 insertions(+), 37 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,6 +13,11 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
+struct flush {
+	struct list_head l;
+	unsigned long addr;
+};
+
 pte_t *lookup_address(unsigned long address, int *level)
 { 
 	pgd_t *pgd = pgd_offset_k(address);
@@ -63,6 +68,11 @@ static struct page *split_large_page(uns
 	return base;
 } 
 
+struct flush_arg {
+	int full_flush;
+	struct list_head l;
+};
+
 void clflush_cache_range(void *adr, int size)
 {
 	int i;
@@ -72,27 +82,27 @@ void clflush_cache_range(void *adr, int 
 
 static void flush_kernel_map(void *arg)
 {
-	struct list_head *l = (struct list_head *)arg;
-	struct page *pg;
+	struct flush_arg *a = (struct flush_arg *)arg;
+	struct flush *f;
+
+	if (!cpu_has_clflush)
+		a->full_flush = 1;
 
 	/* When clflush is available always use it because it is
 	   much cheaper than WBINVD. */
-	/* clflush is still broken. Disable for now. */
-	if (1 || !cpu_has_clflush)
+	if (a->full_flush)
 		asm volatile("wbinvd" ::: "memory");
-	else list_for_each_entry(pg, l, lru) {
-		void *adr = page_address(pg);
-		clflush_cache_range(adr, PAGE_SIZE);
+	list_for_each_entry(f, &a->l, l) {
+		if (!a->full_flush)
+			clflush_cache_range((void *)f->addr, PAGE_SIZE);
 	}
 	__flush_tlb_all();
 }
 
-static inline void flush_map(struct list_head *l)
-{	
-	on_each_cpu(flush_kernel_map, l, 1, 1);
-}
-
-static LIST_HEAD(deferred_pages); /* protected by init_mm.mmap_sem */
+/* both protected by init_mm.mmap_sem */
+static int full_flush;
+static LIST_HEAD(deferred_pages);
+static LIST_HEAD(flush_pages);
 
 static inline void save_page(struct page *fpage)
 {
@@ -124,6 +134,25 @@ static void revert_page(unsigned long ad
 	set_pte((pte_t *)pmd, large_pte);
 }      
 
+/*
+ * Mark the address for flushing later in global_tlb_flush().
+ *
+ * Other parts of the kernel are already in a feeding frenzy over the various
+ * struct page fields. Instead of trying to compete allocate a separate
+ * data structure to keep track of the flush. This has the added bonus that it will
+ * work for MMIO holes without mem_map too.
+ */
+static void set_tlb_flush(unsigned long address)
+{
+	struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
+	if (!f) {
+		full_flush = 1;
+		return;
+	}
+	f->addr = address;
+	list_add_tail(&f->l, &flush_pages);
+}
+
 static int
 __change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
 				   pgprot_t ref_prot)
@@ -136,8 +165,11 @@ __change_page_attr(unsigned long address
 	kpte = lookup_address(address, &level);
 	if (!kpte) return 0;
 	kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
-	BUG_ON(PageLRU(kpte_page));
 	BUG_ON(PageCompound(kpte_page));
+	BUG_ON(PageLRU(kpte_page));
+
+	set_tlb_flush(address);
+
 	if (pgprot_val(prot) != pgprot_val(ref_prot)) { 
 		if (!pte_huge(*kpte)) {
 			set_pte(kpte, pfn_pte(pfn, prot));
@@ -231,7 +263,9 @@ int change_page_attr(struct page *page, 
 void global_flush_tlb(void)
 { 
 	struct page *pg, *next;
-	struct list_head l;
+	struct flush *f, *fnext;
+	struct flush_arg arg;
+	struct list_head free_pages;
 
 	/*
 	 * Write-protect the semaphore, to exclude two contexts
@@ -239,12 +273,19 @@ void global_flush_tlb(void)
 	 * exclude new additions to the deferred_pages list:
 	 */
 	down_write(&init_mm.mmap_sem);
-	list_replace_init(&deferred_pages, &l);
+	arg.full_flush = full_flush;
+	full_flush = 0;
+	list_replace_init(&flush_pages, &arg.l);
+	list_replace_init(&deferred_pages, &free_pages);
 	up_write(&init_mm.mmap_sem);
 
-	flush_map(&l);
+	on_each_cpu(flush_kernel_map, &arg, 1, 1);
+
+	list_for_each_entry_safe(f, fnext, &arg.l, l) {
+		kfree(f);
+	}
 
-	list_for_each_entry_safe(pg, next, &l, lru) {
+	list_for_each_entry_safe(pg, next, &free_pages, lru) {
 		list_del(&pg->lru);
 		clear_bit(PG_arch_1, &pg->flags);
 		if (page_private(pg) != 0)
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -15,8 +15,15 @@
 #include <asm/sections.h>
 
 /* Protected by init_mm.mmap_sem */
+/* Variables protected by cpa_lock */
+static int full_flush;
 static struct list_head df_list = LIST_HEAD_INIT(df_list);
+static LIST_HEAD(flush_pages);
 
+struct flush {
+	struct list_head l;
+	unsigned long addr;
+};
 pte_t *lookup_address(unsigned long address, int *level)
 { 
 	pgd_t *pgd = pgd_offset_k(address);
@@ -67,25 +74,31 @@ static struct page *split_large_page(uns
 	return base;
 } 
 
-static void cache_flush_page(struct page *p)
+struct flush_arg {
+	int full_flush;
+	struct list_head l;
+};
+
+void clflush_cache_range(void *adr, int size)
 { 
-	void *adr = page_address(p);
 	int i;
-	for (i = 0; i < PAGE_SIZE; i += boot_cpu_data.x86_clflush_size)
+	for (i = 0; i < size; i += boot_cpu_data.x86_clflush_size)
 		clflush(adr+i);
 }
 
 static void flush_kernel_map(void *arg)
 {
-	struct list_head *lh = (struct list_head *)arg;
-	struct page *p;
+	struct flush_arg *a = (struct flush_arg *)arg;
+	struct flush *f;
 
-	/* High level code is not ready for clflush yet */
-	if (0 && cpu_has_clflush) {
-		list_for_each_entry (p, lh, lru)
-			cache_flush_page(p);
-	} else if (boot_cpu_data.x86_model >= 4)
+	if (!cpu_has_clflush)
+		a->full_flush = 1;
+	if (a->full_flush && boot_cpu_data.x86_model >= 4)
 		wbinvd();
+	list_for_each_entry(f, &a->l, l) {
+		if (!a->full_flush)
+			clflush_cache_range((void *)f->addr, PAGE_SIZE);
+	}
 
 	/* Flush all to work around Errata in early athlons regarding 
 	 * large page flushing. 
@@ -141,6 +154,25 @@ static inline void save_page(struct page
 		list_add(&kpte_page->lru, &df_list);
 }
 
+/*
+ * Mark the address for flushing later in global_tlb_flush().
+ *
+ * Other parts of the kernel are already in a feeding frenzy over the various
+ * struct page fields. Instead of trying to compete allocate a separate
+ * data structure to keep track of the flush. This has the added bonus that it will
+ * work for MMIO holes without mem_map too.
+ */
+static void set_tlb_flush(unsigned long address)
+{
+	struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
+	if (!f) {
+		full_flush = 1;
+		return;
+	}
+	f->addr = address;
+	list_add_tail(&f->l, &flush_pages);
+}
+
 static int
 __change_page_attr(struct page *page, pgprot_t prot)
 { 
@@ -159,6 +191,8 @@ __change_page_attr(struct page *page, pg
 	BUG_ON(PageLRU(kpte_page));
 	BUG_ON(PageCompound(kpte_page));
 
+	set_tlb_flush(address);
+
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
 		if (!pte_huge(*kpte)) {
 			set_pte_atomic(kpte, mk_pte(page, prot)); 
@@ -199,11 +233,6 @@ __change_page_attr(struct page *page, pg
 	return 0;
 } 
 
-static inline void flush_map(struct list_head *l)
-{
-	on_each_cpu(flush_kernel_map, l, 1, 1);
-}
-
 /*
  * Change the page attributes of an page in the linear mapping.
  *
@@ -234,16 +263,27 @@ int change_page_attr(struct page *page, 
 
 void global_flush_tlb(void)
 {
-	struct list_head l;
+	struct flush_arg arg;
 	struct page *pg, *next;
+	struct flush *f, *fnext;
+	struct list_head free_pages;
 
 	BUG_ON(irqs_disabled());
 
 	down_write(&init_mm.mmap_sem);
-	list_replace_init(&df_list, &l);
+	arg.full_flush = full_flush;
+	full_flush = 0;
+	list_replace_init(&flush_pages, &arg.l);
+	list_replace_init(&df_list, &free_pages);
 	up_write(&init_mm.mmap_sem);
-	flush_map(&l);
-	list_for_each_entry_safe(pg, next, &l, lru) {
+
+	on_each_cpu(flush_kernel_map, &arg, 1, 1);
+
+	list_for_each_entry_safe(f, fnext, &arg.l, l) {
+		kfree(f);
+	}
+
+	list_for_each_entry_safe(pg, next, &free_pages, lru) {
 		list_del(&pg->lru);
 		clear_bit(PG_arch_1, &pg->flags);
 		if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [13/28] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (10 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [12/28] CPA: CLFLUSH support in change_page_attr() Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [14/28] CPA: Use page granuality TLB flushing " Andi Kleen
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Instead of open coding the bit accesses uses standard style
*PageDeferred* macros. 

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   14 ++++++++++----
 arch/x86/mm/pageattr_64.c |   11 +++++++++--
 2 files changed, 19 insertions(+), 6 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -18,6 +18,13 @@ struct flush {
 	unsigned long addr;
 };
 
+#define PG_deferred PG_arch_1
+
+#define PageDeferred(p) test_bit(PG_deferred, &(p)->flags)
+#define SetPageDeferred(p) set_bit(PG_deferred, &(p)->flags)
+#define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
+#define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)
+
 pte_t *lookup_address(unsigned long address, int *level)
 { 
 	pgd_t *pgd = pgd_offset_k(address);
@@ -106,7 +113,7 @@ static LIST_HEAD(flush_pages);
 
 static inline void save_page(struct page *fpage)
 {
-	if (!test_and_set_bit(PG_arch_1, &fpage->flags))
+	if (!TestSetPageDeferred(fpage))
 		list_add(&fpage->lru, &deferred_pages);
 }
 
@@ -287,7 +294,7 @@ void global_flush_tlb(void)
 
 	list_for_each_entry_safe(pg, next, &free_pages, lru) {
 		list_del(&pg->lru);
-		clear_bit(PG_arch_1, &pg->flags);
+		ClearPageDeferred(pg);
 		if (page_private(pg) != 0)
 			continue;
 		ClearPagePrivate(pg);
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -14,8 +14,14 @@
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
 
-/* Protected by init_mm.mmap_sem */
-/* Variables protected by cpa_lock */
+#define PG_deferred PG_arch_1
+
+#define PageDeferred(p) test_bit(PG_deferred, &(p)->flags)
+#define SetPageDeferred(p) set_bit(PG_deferred, &(p)->flags)
+#define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
+#define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)
+
+/* Variables protected by init_mm.mmap_sem */
 static int full_flush;
 static struct list_head df_list = LIST_HEAD_INIT(df_list);
 static LIST_HEAD(flush_pages);
@@ -150,7 +156,7 @@ static inline void revert_page(struct pa
 
 static inline void save_page(struct page *kpte_page)
 {
-	if (!test_and_set_bit(PG_arch_1, &kpte_page->flags))
+	if (!TestSetPageDeferred(kpte_page))
 		list_add(&kpte_page->lru, &df_list);
 }
 
@@ -285,7 +291,7 @@ void global_flush_tlb(void)
 
 	list_for_each_entry_safe(pg, next, &free_pages, lru) {
 		list_del(&pg->lru);
-		clear_bit(PG_arch_1, &pg->flags);
+		ClearPageDeferred(pg);
 		if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)
 			continue;
 		ClearPagePrivate(pg);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [14/28] CPA: Use page granuality TLB flushing in change_page_attr
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (11 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [13/28] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [15/28] CPA: Don't flush the caches when the CPU supports self-snoop Andi Kleen
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


With the infrastructure added for CLFLUSH it is possible
to only TLB flush the actually changed pages in change_page_attr()

Take care of old Athlon K7 Errata on the 32bit version

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   15 ++++++++-------
 arch/x86/mm/pageattr_64.c |   10 +++++-----
 2 files changed, 13 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,19 +97,20 @@ static void flush_kernel_map(void *arg)
 	struct flush_arg *a = (struct flush_arg *)arg;
 	struct flush *f;
 
-	if (!cpu_has_clflush)
-		a->full_flush = 1;
-	if (a->full_flush && boot_cpu_data.x86_model >= 4)
+	if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4)
 		wbinvd();
 	list_for_each_entry(f, &a->l, l) {
 		if (!a->full_flush)
 			clflush_cache_range((void *)f->addr, PAGE_SIZE);
+		if (!a->full_flush)
+			__flush_tlb_one(f->addr);
 	}
 
-	/* Flush all to work around Errata in early athlons regarding 
-	 * large page flushing. 
-	 */
-	__flush_tlb_all(); 	
+	/* Handle errata with flushing large pages in early Athlons */
+	if (a->full_flush ||
+	    (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+	     boot_cpu_data.x86 == 7))
+		__flush_tlb_all();
 }
 
 static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) 
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -92,18 +92,18 @@ static void flush_kernel_map(void *arg)
 	struct flush_arg *a = (struct flush_arg *)arg;
 	struct flush *f;
 
-	if (!cpu_has_clflush)
-		a->full_flush = 1;
-
 	/* When clflush is available always use it because it is
 	   much cheaper than WBINVD. */
-	if (a->full_flush)
+	if (a->full_flush || !cpu_has_clflush)
 		asm volatile("wbinvd" ::: "memory");
 	list_for_each_entry(f, &a->l, l) {
 		if (!a->full_flush)
 			clflush_cache_range((void *)f->addr, PAGE_SIZE);
+		if (!a->full_flush)
+			__flush_tlb_one(f->addr);
 	}
-	__flush_tlb_all();
+	if (a->full_flush)
+		__flush_tlb_all();
 }
 
 /* both protected by init_mm.mmap_sem */

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [15/28] CPA: Don't flush the caches when the CPU supports self-snoop
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (12 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [14/28] CPA: Use page granuality TLB flushing " Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [16/28] CPA: Use wbinvd() macro instead of inline assembly in 64bit c_p_a() Andi Kleen
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


When the self-snoop CPUID bit is set change_page_attr() only needs to flush
TLBs, but not the caches.

The description of self-snoop in the Intel manuals is a bit vague
but I got confirmation that this is what SS really means.

This should improve c_p_a() performance significantly on newer
Intel CPUs.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c    |    5 +++--
 arch/x86/mm/pageattr_64.c    |    4 ++--
 include/asm-x86/cpufeature.h |    1 +
 3 files changed, 6 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,10 +97,11 @@ static void flush_kernel_map(void *arg)
 	struct flush_arg *a = (struct flush_arg *)arg;
 	struct flush *f;
 
-	if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4)
+	if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4 &&
+		!cpu_has_ss)
 		wbinvd();
 	list_for_each_entry(f, &a->l, l) {
-		if (!a->full_flush)
+		if (!a->full_flush && !cpu_has_ss)
 			clflush_cache_range((void *)f->addr, PAGE_SIZE);
 		if (!a->full_flush)
 			__flush_tlb_one(f->addr);
Index: linux/include/asm-x86/cpufeature.h
===================================================================
--- linux.orig/include/asm-x86/cpufeature.h
+++ linux/include/asm-x86/cpufeature.h
@@ -167,6 +167,7 @@
 #define cpu_has_pebs		boot_cpu_has(X86_FEATURE_PEBS)
 #define cpu_has_clflush		boot_cpu_has(X86_FEATURE_CLFLSH)
 #define cpu_has_bts		boot_cpu_has(X86_FEATURE_BTS)
+#define cpu_has_ss 		boot_cpu_has(X86_FEATURE_SELFSNOOP)
 
 #if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
 # define cpu_has_invlpg		1
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -94,10 +94,10 @@ static void flush_kernel_map(void *arg)
 
 	/* When clflush is available always use it because it is
 	   much cheaper than WBINVD. */
-	if (a->full_flush || !cpu_has_clflush)
+	if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
 		asm volatile("wbinvd" ::: "memory");
 	list_for_each_entry(f, &a->l, l) {
-		if (!a->full_flush)
+		if (!a->full_flush && !cpu_has_ss)
 			clflush_cache_range((void *)f->addr, PAGE_SIZE);
 		if (!a->full_flush)
 			__flush_tlb_one(f->addr);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [16/28] CPA: Use wbinvd() macro instead of inline assembly in 64bit c_p_a()
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (13 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [15/28] CPA: Don't flush the caches when the CPU supports self-snoop Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [17/28] CPA: Reorder TLB / cache flushes to follow Intel recommendation Andi Kleen
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_64.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -95,7 +95,7 @@ static void flush_kernel_map(void *arg)
 	/* When clflush is available always use it because it is
 	   much cheaper than WBINVD. */
 	if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
-		asm volatile("wbinvd" ::: "memory");
+		wbinvd();
 	list_for_each_entry(f, &a->l, l) {
 		if (!a->full_flush && !cpu_has_ss)
 			clflush_cache_range((void *)f->addr, PAGE_SIZE);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [17/28] CPA: Reorder TLB / cache flushes to follow Intel recommendation
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (14 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [16/28] CPA: Use wbinvd() macro instead of inline assembly in 64bit c_p_a() Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [18/28] CPA: Make change_page_attr() more robust against use of PAT bits Andi Kleen
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Intel recommends to first flush the TLBs and then the caches
on caching attribute changes. c_p_a() previously did it the
other way round. Reorder that.

The procedure is still not fully compliant to the Intel documentation
because Intel recommends a all CPU synchronization step between
the TLB flushes and the cache flushes.

However on all new Intel CPUs this is now meaningless anyways
because they support Self-Snoop and can skip the cache flush 
step anyways.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   13 ++++++++++---
 arch/x86/mm/pageattr_64.c |   14 ++++++++++----
 2 files changed, 20 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,9 +97,6 @@ static void flush_kernel_map(void *arg)
 	struct flush_arg *a = (struct flush_arg *)arg;
 	struct flush *f;
 
-	if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4 &&
-		!cpu_has_ss)
-		wbinvd();
 	list_for_each_entry(f, &a->l, l) {
 		if (!a->full_flush && !cpu_has_ss)
 			clflush_cache_range((void *)f->addr, PAGE_SIZE);
@@ -112,6 +109,16 @@ static void flush_kernel_map(void *arg)
 	    (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
 	     boot_cpu_data.x86 == 7))
 		__flush_tlb_all();
+
+	/*
+	 * RED-PEN: Intel documentation ask for a CPU synchronization step
+	 * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
+	 */
+
+	if ((!cpu_has_clflush || a->full_flush) &&
+	    !cpu_has_ss && boot_cpu_data.x86_model >= 4)
+		wbinvd();
+
 }
 
 static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) 
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -94,16 +94,22 @@ static void flush_kernel_map(void *arg)
 
 	/* When clflush is available always use it because it is
 	   much cheaper than WBINVD. */
-	if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
-		wbinvd();
 	list_for_each_entry(f, &a->l, l) {
-		if (!a->full_flush && !cpu_has_ss)
-			clflush_cache_range((void *)f->addr, PAGE_SIZE);
 		if (!a->full_flush)
 			__flush_tlb_one(f->addr);
+		if (!a->full_flush && !cpu_has_ss)
+			clflush_cache_range((void *)f->addr, PAGE_SIZE);
 	}
 	if (a->full_flush)
 		__flush_tlb_all();
+
+	/*
+	 * RED-PEN: Intel documentation ask for a CPU synchronization step
+	 * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
+	 */
+
+	if ((!cpu_has_clflush || a->full_flush) && !cpu_has_ss)
+		wbinvd();
 }
 
 /* both protected by init_mm.mmap_sem */

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [18/28] CPA: Make change_page_attr() more robust against use of PAT bits
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (15 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [17/28] CPA: Reorder TLB / cache flushes to follow Intel recommendation Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [19/28] CPA: Limit cache flushing to pages that really change caching Andi Kleen
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Use the page table level instead of the PSE bit to check if the PTE 
is for a 4K page or not. This makes the code more robust when the PAT
bit is changed because the PAT bit on 4K pages is in the same position as the
PSE bit.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |    4 ++--
 arch/x86/mm/pageattr_64.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -209,7 +209,7 @@ __change_page_attr(struct page *page, pg
 	set_tlb_flush(address);
 
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
-		if (!pte_huge(*kpte)) {
+		if (level == 3) {
 			set_pte_atomic(kpte, mk_pte(page, prot)); 
 		} else {
 			pgprot_t ref_prot;
@@ -225,7 +225,7 @@ __change_page_attr(struct page *page, pg
 			kpte_page = split;
 		}
 		page_private(kpte_page)++;
-	} else if (!pte_huge(*kpte)) {
+	} else if (level == 3) {
 		set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
 		BUG_ON(page_private(kpte_page) == 0);
 		page_private(kpte_page)--;
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -184,7 +184,7 @@ __change_page_attr(unsigned long address
 	set_tlb_flush(address);
 
 	if (pgprot_val(prot) != pgprot_val(ref_prot)) { 
-		if (!pte_huge(*kpte)) {
+		if (level == 4) {
 			set_pte(kpte, pfn_pte(pfn, prot));
 		} else {
  			/*
@@ -201,7 +201,7 @@ __change_page_attr(unsigned long address
 			kpte_page = split;
 		}
 		page_private(kpte_page)++;
-	} else if (!pte_huge(*kpte)) {
+	} else if (level == 4) {
 		set_pte(kpte, pfn_pte(pfn, ref_prot));
 		BUG_ON(page_private(kpte_page) == 0);
 		page_private(kpte_page)--;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [19/28] CPA: Limit cache flushing to pages that really change caching
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (16 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [18/28] CPA: Make change_page_attr() more robust against use of PAT bits Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [20/28] CPA: Fix inaccurate comments in 64bit change_page_attr() Andi Kleen
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Previously change_page_attr always flushed caches even for 
pages that only change a non caching related attribute (like RO for
read/write protection).

This changes the flush code to only flush the cache when the
caching attributes actually change. 

I made some effort to already handle reprogrammed PAT bits, although this
is not strictly needed right now by the core kernel (but that will
probably change soon) 

This will only make a difference on AMD CPUs or older Intel CPUs,
because all newer Intel CPUs support "self-snoop" and do not require
this cache flushing anyways.

Another advantage of this patch is that it prevents recursive
slab calls with slab debugging.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c    |   43 +++++++++++++++++++++++++++++++++----------
 arch/x86/mm/pageattr_64.c    |   42 ++++++++++++++++++++++++++++++++++--------
 include/asm-x86/pgtable_32.h |    8 ++++++++
 include/asm-x86/pgtable_64.h |    8 ++++++++
 4 files changed, 83 insertions(+), 18 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,7 +13,10 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 
+enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };
+
 struct flush {
+	enum flush_mode mode;
 	struct list_head l;
 	unsigned long addr;
 };
@@ -76,7 +79,7 @@ static struct page *split_large_page(uns
 } 
 
 struct flush_arg {
-	int full_flush;
+	enum flush_mode full_flush;
 	struct list_head l;
 };
 
@@ -91,14 +94,19 @@ static void flush_kernel_map(void *arg)
 {
 	struct flush_arg *a = (struct flush_arg *)arg;
 	struct flush *f;
+	int cache_flush = a->full_flush == FLUSH_CACHE;
 
 	/* When clflush is available always use it because it is
 	   much cheaper than WBINVD. */
 	list_for_each_entry(f, &a->l, l) {
 		if (!a->full_flush)
 			__flush_tlb_one(f->addr);
-		if (!a->full_flush && !cpu_has_ss)
-			clflush_cache_range((void *)f->addr, PAGE_SIZE);
+		if (f->mode == FLUSH_CACHE && !cpu_has_ss) {
+			if (cpu_has_clflush)
+				clflush_cache_range((void *)f->addr, PAGE_SIZE);
+			else
+				cache_flush++;
+		}
 	}
 	if (a->full_flush)
 		__flush_tlb_all();
@@ -108,12 +116,12 @@ static void flush_kernel_map(void *arg)
 	 * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
 	 */
 
-	if ((!cpu_has_clflush || a->full_flush) && !cpu_has_ss)
+	if (cache_flush > 0 && !cpu_has_ss)
 		wbinvd();
 }
 
 /* both protected by init_mm.mmap_sem */
-static int full_flush;
+static enum flush_mode full_flush;
 static LIST_HEAD(deferred_pages);
 static LIST_HEAD(flush_pages);
 
@@ -155,17 +163,35 @@ static void revert_page(unsigned long ad
  * data structure to keep track of the flush. This has the added bonus that it will
  * work for MMIO holes without mem_map too.
  */
-static void set_tlb_flush(unsigned long address)
+static void set_tlb_flush(unsigned long address, int cache)
 {
+	enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
 	struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
 	if (!f) {
-		full_flush = 1;
+		if (full_flush < mode)
+			full_flush = mode;
 		return;
 	}
+
 	f->addr = address;
+	f->mode = mode;
 	list_add_tail(&f->l, &flush_pages);
 }
 
+static unsigned short pat_bit[5] = {
+	[4] = _PAGE_PAT,
+	[3] = _PAGE_PAT_LARGE,
+};
+
+static int cache_attr_changed(pte_t pte, pgprot_t prot, int level)
+{
+	unsigned a = pte_val(pte) & (_PAGE_PCD|_PAGE_PWT);
+	/* Correct for PAT bit being in different position on large pages */
+	if (pte_val(pte) & pat_bit[level])
+		a |= _PAGE_PAT;
+	return a != (pgprot_val(prot) & _PAGE_CACHE);
+}
+
 static int
 __change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
 				   pgprot_t ref_prot)
@@ -181,7 +207,7 @@ __change_page_attr(unsigned long address
 	BUG_ON(PageCompound(kpte_page));
 	BUG_ON(PageLRU(kpte_page));
 
-	set_tlb_flush(address);
+	set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));
 
 	if (pgprot_val(prot) != pgprot_val(ref_prot)) { 
 		if (level == 4) {
Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -150,10 +150,15 @@ static inline pte_t ptep_get_and_clear_f
 #define _PAGE_BIT_ACCESSED	5
 #define _PAGE_BIT_DIRTY		6
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page */
+#define _PAGE_BIT_PAT		7	/* Only on 4kb pages */
+#define _PAGE_BIT_PAT_LARGE    12	/* Only on large pages */
 #define _PAGE_BIT_FILE		6
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
+/* Needs special handling for large pages */
+#define _PAGE_CACHE	 (_PAGE_PCD|_PAGE_PWT|_PAGE_PAT)
+
 #define _PAGE_PRESENT	(_AC(1, UL)<<_PAGE_BIT_PRESENT)
 #define _PAGE_RW	(_AC(1, UL)<<_PAGE_BIT_RW)
 #define _PAGE_USER	(_AC(1, UL)<<_PAGE_BIT_USER)
@@ -171,6 +176,9 @@ static inline pte_t ptep_get_and_clear_f
 #define _PAGE_PROTNONE	0x080	/* If not present */
 #define _PAGE_NX        (_AC(1, UL)<<_PAGE_BIT_NX)
 
+#define _PAGE_PAT	(_AC(1,UL) << _PAGE_BIT_PAT)
+#define _PAGE_PAT_LARGE	(_AC(1,UL) << _PAGE_BIT_PAT_LARGE)
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
 
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -21,12 +21,15 @@
 #define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
 #define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)
 
+enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };
+
 /* Variables protected by init_mm.mmap_sem */
-static int full_flush;
+static enum flush_mode full_flush;
 static struct list_head df_list = LIST_HEAD_INIT(df_list);
 static LIST_HEAD(flush_pages);
 
 struct flush {
+	enum flush_mode mode;
 	struct list_head l;
 	unsigned long addr;
 };
@@ -81,7 +84,7 @@ static struct page *split_large_page(uns
 } 
 
 struct flush_arg {
-	int full_flush;
+	enum flush_mode full_flush;
 	struct list_head l;
 };
 
@@ -96,12 +99,17 @@ static void flush_kernel_map(void *arg)
 {
 	struct flush_arg *a = (struct flush_arg *)arg;
 	struct flush *f;
+	int cache_flush = a->full_flush == FLUSH_CACHE;
 
 	list_for_each_entry(f, &a->l, l) {
-		if (!a->full_flush && !cpu_has_ss)
-			clflush_cache_range((void *)f->addr, PAGE_SIZE);
 		if (!a->full_flush)
 			__flush_tlb_one(f->addr);
+		if (f->mode == FLUSH_CACHE && !cpu_has_ss) {
+			if (cpu_has_clflush)
+				clflush_cache_range((void *)f->addr, PAGE_SIZE);
+			else
+				cache_flush++;
+		}
 	}
 
 	/* Handle errata with flushing large pages in early Athlons */
@@ -115,10 +123,8 @@ static void flush_kernel_map(void *arg)
 	 * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
 	 */
 
-	if ((!cpu_has_clflush || a->full_flush) &&
-	    !cpu_has_ss && boot_cpu_data.x86_model >= 4)
+	if (cache_flush > 0 && !cpu_has_ss && boot_cpu_data.x86_model >= 4)
 		wbinvd();
-
 }
 
 static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) 
@@ -169,6 +175,20 @@ static inline void save_page(struct page
 		list_add(&kpte_page->lru, &df_list);
 }
 
+static unsigned short pat_bit[4] = {
+	[3] = _PAGE_PAT,
+	[2] = _PAGE_PAT_LARGE,
+};
+
+static int cache_attr_changed(pte_t pte, pgprot_t prot, int level)
+{
+	unsigned long a = pte_val(pte) & (_PAGE_PCD|_PAGE_PWT);
+	/* Correct for PAT bit being in different position on large pages */
+	if (pte_val(pte) & pat_bit[level])
+		a |= _PAGE_PAT;
+	return a != (pgprot_val(prot) & _PAGE_CACHE);
+}
+
 /*
  * Mark the address for flushing later in global_tlb_flush().
  *
@@ -177,14 +197,17 @@ static inline void save_page(struct page
  * data structure to keep track of the flush. This has the added bonus that it will
  * work for MMIO holes without mem_map too.
  */
-static void set_tlb_flush(unsigned long address)
+static void set_tlb_flush(unsigned long address, int cache)
 {
+	enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
 	struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
 	if (!f) {
-		full_flush = 1;
+		if (full_flush < mode)
+			full_flush = mode;
 		return;
 	}
 	f->addr = address;
+	f->mode = mode;
 	list_add_tail(&f->l, &flush_pages);
 }
 
@@ -206,7 +229,7 @@ __change_page_attr(struct page *page, pg
 	BUG_ON(PageLRU(kpte_page));
 	BUG_ON(PageCompound(kpte_page));
 
-	set_tlb_flush(address);
+	set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));
 
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
 		if (level == 3) {
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -97,6 +97,8 @@ void paging_init(void);
 #define _PAGE_BIT_ACCESSED	5
 #define _PAGE_BIT_DIRTY		6
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page, Pentium+, if present.. */
+#define _PAGE_BIT_PAT		7	/* only on 4K pages */
+#define _PAGE_BIT_PAT_LARGE	12	/* only on large pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
@@ -126,6 +128,12 @@ void paging_init(void);
 #define _PAGE_NX	0
 #endif
 
+#define _PAGE_PAT	(1U << _PAGE_BIT_PAT)
+#define _PAGE_PAT_LARGE	(1U << _PAGE_BIT_PAT_LARGE)
+
+/* Needs special handling for large pages */
+#define _PAGE_CACHE	(_PAGE_PWT|_PAGE_PCD|_PAGE_PAT)
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _PAGE_CHG_MASK	(PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [20/28] CPA: Fix inaccurate comments in 64bit change_page_attr()
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (17 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [19/28] CPA: Limit cache flushing to pages that really change caching Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [21/28] CPA: Dump pagetable when inconsistency is detected Andi Kleen
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


No code changes.
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_64.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -120,7 +120,7 @@ static void flush_kernel_map(void *arg)
 		wbinvd();
 }
 
-/* both protected by init_mm.mmap_sem */
+/* All protected by init_mm.mmap_sem */
 static enum flush_mode full_flush;
 static LIST_HEAD(deferred_pages);
 static LIST_HEAD(flush_pages);
@@ -132,7 +132,7 @@ static inline void save_page(struct page
 }
 
 /* 
- * No more special protections in this 2/4MB area - revert to a
+ * No more special protections in this 2MB area - revert to a
  * large page again. 
  */
 static void revert_page(unsigned long address, pgprot_t ref_prot)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [21/28] CPA: Dump pagetable when inconsistency is detected
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (18 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [20/28] CPA: Fix inaccurate comments in 64bit change_page_attr() Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [22/28] CPA: Only queue actually unused page table pages for freeing Andi Kleen
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


When c_p_a() detects a inconsistency in the kernel page tables 
it BUGs. When this happens dump the page table first to avoid one 
bug reporting round trip.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   11 ++++++++++-
 arch/x86/mm/pageattr_64.c |   11 ++++++++++-
 2 files changed, 20 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -12,6 +12,7 @@
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
 #include <asm/io.h>
+#include <asm/kdebug.h>
 
 enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };
 
@@ -231,8 +232,16 @@ __change_page_attr(unsigned long address
 		set_pte(kpte, pfn_pte(pfn, ref_prot));
 		BUG_ON(page_private(kpte_page) == 0);
 		page_private(kpte_page)--;
-	} else
+	} else {
+		/*
+		 * When you're here you either did set the same page to PAGE_KERNEL
+	  	 * two times in a row or the page table reference counting is broken again.
+		 * To catch the later bug for now (sorry)
+		 */
+		printk(KERN_ERR "address %lx\n", address);
+		dump_pagetable(address);
 		BUG();
+	}
 
 	/* on x86-64 the direct mapping set at boot is not using 4k pages */
  	BUG_ON(PageReserved(kpte_page));
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -13,6 +13,7 @@
 #include <asm/tlbflush.h>
 #include <asm/pgalloc.h>
 #include <asm/sections.h>
+#include <asm/kdebug.h>
 
 #define PG_deferred PG_arch_1
 
@@ -252,8 +253,16 @@ __change_page_attr(struct page *page, pg
 		set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
 		BUG_ON(page_private(kpte_page) == 0);
 		page_private(kpte_page)--;
-	} else
+	} else {
+		/*
+		 * When you're here you either did set the same page to PAGE_KERNEL
+	  	 * two times in a row or the page table reference counting is broken again.
+		 * To catch the later bug for now (sorry)
+		 */
+		printk(KERN_ERR "address %lx\n", address);
+		dump_pagetable(address);
 		BUG();
+	}
 
 	/*
 	 * If the pte was reserved, it means it was created at boot

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [22/28] CPA: Only queue actually unused page table pages for freeing
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (19 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [21/28] CPA: Dump pagetable when inconsistency is detected Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [23/28] CPA: Remove unnecessary masking of address Andi Kleen
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


With the separate data structure added for flush earlier it is only
needed to call save_page() now on pte pages that have been already reverted.

Also move all freeing checks into the caller.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |    4 +---
 arch/x86/mm/pageattr_64.c |    7 +++----
 2 files changed, 4 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -270,9 +270,9 @@ __change_page_attr(struct page *page, pg
 	 * replace it with a largepage.
 	 */
 
-	save_page(kpte_page);
 	if (!PageReserved(kpte_page)) {
 		if (cpu_has_pse && (page_private(kpte_page) == 0)) {
+			save_page(kpte_page);
 			paravirt_release_pt(page_to_pfn(kpte_page));
 			revert_page(kpte_page, address);
 		}
@@ -333,8 +333,6 @@ void global_flush_tlb(void)
 	list_for_each_entry_safe(pg, next, &free_pages, lru) {
 		list_del(&pg->lru);
 		ClearPageDeferred(pg);
-		if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)
-			continue;
 		ClearPagePrivate(pg);
 		__free_page(pg);
 	}
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -246,9 +246,10 @@ __change_page_attr(unsigned long address
 	/* on x86-64 the direct mapping set at boot is not using 4k pages */
  	BUG_ON(PageReserved(kpte_page));
 
-	save_page(kpte_page);
-	if (page_private(kpte_page) == 0)
+	if (page_private(kpte_page) == 0) {
+		save_page(kpte_page);
 		revert_page(address, ref_prot);
+	}
 	return 0;
 } 
 
@@ -336,8 +337,6 @@ void global_flush_tlb(void)
 	list_for_each_entry_safe(pg, next, &free_pages, lru) {
 		list_del(&pg->lru);
 		ClearPageDeferred(pg);
-		if (page_private(pg) != 0)
-			continue;
 		ClearPagePrivate(pg);
 		__free_page(pg);
 	} 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [23/28] CPA: Remove unnecessary masking of address
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (20 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [22/28] CPA: Only queue actually unused page table pages for freeing Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [24/28] CPA: Only unmap kernel init pages in text mapping when CONFIG_DEBUG_RODATA is set Andi Kleen
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


virt_to_page does not care about the bits below the page granuality.
So don't mask them.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_64.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -204,7 +204,7 @@ __change_page_attr(unsigned long address
 
 	kpte = lookup_address(address, &level);
 	if (!kpte) return 0;
-	kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+	kpte_page = virt_to_page(kpte);
 	BUG_ON(PageCompound(kpte_page));
 	BUG_ON(PageLRU(kpte_page));
 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [24/28] CPA: Only unmap kernel init pages in text mapping when CONFIG_DEBUG_RODATA is set
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (21 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [23/28] CPA: Remove unnecessary masking of address Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [25/28] CPA: Always do full TLB flush when splitting large pages Andi Kleen
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: jbeulich, linux-kernel


Otherwise the kernel will likely always run with 4K pages instead of 2MB pages,
which is costly in terms of TLBs.

Also optimize it a little bit by using only a single change_page_attr() calls.
This is particularly useful if debugging is enabled inside it because it spams
the logs much less.

Cc: jbeulich@novell.com

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/init_64.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -546,13 +546,15 @@ void free_init_pages(char *what, unsigne
 		init_page_count(virt_to_page(addr));
 		memset((void *)(addr & ~(PAGE_SIZE-1)),
 			POISON_FREE_INITMEM, PAGE_SIZE);
-		if (addr >= __START_KERNEL_map)
-			change_page_attr_addr(addr, 1, __pgprot(0));
 		free_page(addr);
 		totalram_pages++;
 	}
-	if (addr > __START_KERNEL_map)
+#ifdef CONFIG_DEBUG_RODATA
+	if (begin >= __START_KERNEL_map) {
+		change_page_attr_addr(begin, (end - begin)/PAGE_SIZE, __pgprot(0));
 		global_flush_tlb();
+	}
+#endif
 }
 
 void free_initmem(void)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [25/28] CPA: Always do full TLB flush when splitting large pages
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (22 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [24/28] CPA: Only unmap kernel init pages in text mapping when CONFIG_DEBUG_RODATA is set Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [26/28] CPA: Fix reference counting when changing already changed pages Andi Kleen
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


Various CPUs have errata when using INVLPG to flush large pages.
This includes Intel Penryn (AV2) and AMD K7 (#16 in Athlon 4) 
While these happen only in specific circumstances it is still
a little risky and it's not clear the kernel can avoid them all.

Avoid this can of worms by always flushing the full TLB (but 
not the full cache) when splitting a large page. This should
not be that expensive anyways and initial splitting should be
hopefully infrequent.

This also removes the previously hard coded workaround for K7 
Athlon on 32bit.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   17 +++++++++++------
 arch/x86/mm/pageattr_64.c |   11 +++++++++--
 2 files changed, 20 insertions(+), 8 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -113,10 +113,7 @@ static void flush_kernel_map(void *arg)
 		}
 	}
 
-	/* Handle errata with flushing large pages in early Athlons */
-	if (a->full_flush ||
-	    (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
-	     boot_cpu_data.x86 == 7))
+	if (a->full_flush)
 		__flush_tlb_all();
 
 	/*
@@ -198,7 +195,7 @@ static int cache_attr_changed(pte_t pte,
  * data structure to keep track of the flush. This has the added bonus that it will
  * work for MMIO holes without mem_map too.
  */
-static void set_tlb_flush(unsigned long address, int cache)
+static void set_tlb_flush(unsigned long address, int cache, int large)
 {
 	enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
 	struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
@@ -210,6 +207,13 @@ static void set_tlb_flush(unsigned long 
 	f->addr = address;
 	f->mode = mode;
 	list_add_tail(&f->l, &flush_pages);
+
+	/*
+	 * Work around large page INVLPG bugs in early K7 and in Penryn.
+ 	 * When we split a large page always do a full TLB flush.
+	 */
+	if (large && full_flush < FLUSH_TLB)
+		full_flush = FLUSH_TLB;
 }
 
 static int
@@ -230,7 +234,8 @@ __change_page_attr(struct page *page, pg
 	BUG_ON(PageLRU(kpte_page));
 	BUG_ON(PageCompound(kpte_page));
 
-	set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));
+	set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
+			level < 3);
 
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
 		if (level == 3) {
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -164,7 +164,7 @@ static void revert_page(unsigned long ad
  * data structure to keep track of the flush. This has the added bonus that it will
  * work for MMIO holes without mem_map too.
  */
-static void set_tlb_flush(unsigned long address, int cache)
+static void set_tlb_flush(unsigned long address, int cache, int large)
 {
 	enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
 	struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
@@ -177,6 +177,13 @@ static void set_tlb_flush(unsigned long 
 	f->addr = address;
 	f->mode = mode;
 	list_add_tail(&f->l, &flush_pages);
+
+	/*
+	 * Work around large page INVLPG bugs in early K7 and in Penryn.
+ 	 * When we split a large page always do a full TLB flush.
+	 */
+	if (large && full_flush < FLUSH_TLB)
+		full_flush = FLUSH_TLB;
 }
 
 static unsigned short pat_bit[5] = {
@@ -208,7 +215,7 @@ __change_page_attr(unsigned long address
 	BUG_ON(PageCompound(kpte_page));
 	BUG_ON(PageLRU(kpte_page));
 
-	set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));
+	set_tlb_flush(address, cache_attr_changed(*kpte, prot, level), level < 4);
 
 	if (pgprot_val(prot) != pgprot_val(ref_prot)) { 
 		if (level == 4) {

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [26/28] CPA: Fix reference counting when changing already changed pages
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (23 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [25/28] CPA: Always do full TLB flush when splitting large pages Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [27/28] CPA: Change comments of external interfaces to kerneldoc format Andi Kleen
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: jbeulich, linux-kernel


When changing a page that has already been modified to non standard attributes
before don't change the reference count. And when changing back a page
only decrease the ref count if the old attributes were non standard.

Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |   36 +++++++++++++++++++-----------------
 arch/x86/mm/pageattr_64.c |   13 +++++++++----
 2 files changed, 28 insertions(+), 21 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -206,12 +206,13 @@ __change_page_attr(unsigned long address
 { 
 	pte_t *kpte; 
 	struct page *kpte_page;
-	pgprot_t ref_prot2;
+	pgprot_t ref_prot2, oldprot;
 	int level;
 
 	kpte = lookup_address(address, &level);
 	if (!kpte) return 0;
 	kpte_page = virt_to_page(kpte);
+	oldprot = pte_pgprot(*kpte);
 	BUG_ON(PageCompound(kpte_page));
 	BUG_ON(PageLRU(kpte_page));
 
@@ -219,6 +220,8 @@ __change_page_attr(unsigned long address
 
 	if (pgprot_val(prot) != pgprot_val(ref_prot)) { 
 		if (level == 4) {
+			if (pgprot_val(oldprot) == pgprot_val(ref_prot))
+				page_private(kpte_page)++;
 			set_pte(kpte, pfn_pte(pfn, prot));
 		} else {
  			/*
@@ -233,12 +236,14 @@ __change_page_attr(unsigned long address
 			pgprot_val(ref_prot2) &= ~_PAGE_NX;
 			set_pte(kpte, mk_pte(split, ref_prot2));
 			kpte_page = split;
+			page_private(kpte_page)++;
 		}
-		page_private(kpte_page)++;
 	} else if (level == 4) {
+		if (pgprot_val(oldprot) != pgprot_val(ref_prot)) {
+			BUG_ON(page_private(kpte_page) <= 0);
+			page_private(kpte_page)--;
+		}
 		set_pte(kpte, pfn_pte(pfn, ref_prot));
-		BUG_ON(page_private(kpte_page) == 0);
-		page_private(kpte_page)--;
 	} else {
 		/*
 		 * When you're here you either did set the same page to PAGE_KERNEL
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -151,20 +151,16 @@ static void set_pmd_pte(pte_t *kpte, uns
  * No more special protections in this 2/4MB area - revert to a
  * large page again. 
  */
-static inline void revert_page(struct page *kpte_page, unsigned long address)
+static void
+revert_page(struct page *kpte_page, unsigned long address, pgprot_t ref_prot)
 {
-	pgprot_t ref_prot;
 	pte_t *linear;
 
-	ref_prot =
-	((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
-		? PAGE_KERNEL_LARGE_EXEC : PAGE_KERNEL_LARGE;
-
 	linear = (pte_t *)
 		pmd_offset(pud_offset(pgd_offset_k(address), address), address);
 	set_pmd_pte(linear,  address,
-		    pfn_pte((__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT,
-			    ref_prot));
+		    pte_mkhuge(pfn_pte((__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT,
+			    ref_prot)));
 }
 
 static inline void save_page(struct page *kpte_page)
@@ -223,6 +219,8 @@ __change_page_attr(struct page *page, pg
 	unsigned long address;
 	struct page *kpte_page;
 	int level;
+	pgprot_t oldprot;
+	pgprot_t ref_prot = PAGE_KERNEL;
 
 	BUG_ON(PageHighMem(page));
 	address = (unsigned long)page_address(page);
@@ -230,6 +228,8 @@ __change_page_attr(struct page *page, pg
 	kpte = lookup_address(address, &level);
 	if (!kpte)
 		return -EINVAL;
+
+	oldprot = pte_pgprot(*kpte);
 	kpte_page = virt_to_page(kpte);
 	BUG_ON(PageLRU(kpte_page));
 	BUG_ON(PageCompound(kpte_page));
@@ -237,27 +237,29 @@ __change_page_attr(struct page *page, pg
 	set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
 			level < 3);
 
+	if ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
+		ref_prot = PAGE_KERNEL_EXEC;
+
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
 		if (level == 3) {
+			if (pgprot_val(oldprot) == pgprot_val(ref_prot))
+				page_private(kpte_page)++;
 			set_pte_atomic(kpte, mk_pte(page, prot)); 
 		} else {
-			pgprot_t ref_prot;
 			struct page *split;
-
-			ref_prot =
-			((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
-				? PAGE_KERNEL_EXEC : PAGE_KERNEL;
 			split = split_large_page(address, prot, ref_prot);
 			if (!split)
 				return -ENOMEM;
 			set_pmd_pte(kpte,address,mk_pte(split, ref_prot));
 			kpte_page = split;
+			page_private(kpte_page)++;
 		}
-		page_private(kpte_page)++;
 	} else if (level == 3) {
+		if (pgprot_val(oldprot) != pgprot_val(ref_prot)) {
+			BUG_ON(page_private(kpte_page) <= 0);
+			page_private(kpte_page)--;
+		}
 		set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
-		BUG_ON(page_private(kpte_page) == 0);
-		page_private(kpte_page)--;
 	} else {
 		/*
 		 * When you're here you either did set the same page to PAGE_KERNEL
@@ -279,7 +281,7 @@ __change_page_attr(struct page *page, pg
 		if (cpu_has_pse && (page_private(kpte_page) == 0)) {
 			save_page(kpte_page);
 			paravirt_release_pt(page_to_pfn(kpte_page));
-			revert_page(kpte_page, address);
+			revert_page(kpte_page, address, ref_prot);
 		}
 	}
 	return 0;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [27/28] CPA: Change comments of external interfaces to kerneldoc format
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (24 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [26/28] CPA: Fix reference counting when changing already changed pages Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-03 15:24 ` [PATCH CPA] [28/28] Make kernel_text test match boot mapping initialization Andi Kleen
  2008-01-10  9:31 ` CPA patchset Ingo Molnar
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


And clarify description a bit.

Only for 64bit, but the interfaces are identical for 32bit and kerneldoc should
merge them (?) 

Signed-off-by: Andi Kleen <ak@suse.de>

---
 Documentation/DocBook/kernel-api.tmpl |    8 +++++
 arch/x86/mm/pageattr_64.c             |   46 +++++++++++++++++++++++++---------
 2 files changed, 42 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -265,19 +265,19 @@ __change_page_attr(unsigned long address
 	return 0;
 } 
 
-/*
- * Change the page attributes of an page in the linear mapping.
- *
- * This should be used when a page is mapped with a different caching policy
- * than write-back somewhere - some CPUs do not like it when mappings with
- * different caching policies exist. This changes the page attributes of the
- * in kernel linear mapping too.
+/**
+ * change_page_attr_addr - Change page table attributes in linear mapping
+ * @address: Virtual address in linear mapping.
+ * @numpages: Number of pages to change
+ * @prot:    New page table attribute (PAGE_*)
  * 
- * The caller needs to ensure that there are no conflicting mappings elsewhere.
- * This function only deals with the kernel linear map.
- * 
- * Caller must call global_flush_tlb() after this.
+ * Change page attributes of a page in the direct mapping. This is a variant
+ * of change_page_attr() that also works on memory holes that do not have
+ * mem_map entry (pfn_valid() is false).
+ *
+ * See change_page_attr() documentation for more details.
  */
+
 int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
 {
 	int err = 0, kernel_map = 0;
@@ -314,13 +314,35 @@ int change_page_attr_addr(unsigned long 
 	return err;
 }
 
-/* Don't call this for MMIO areas that may not have a mem_map entry */
+/**
+ * change_page_attr - Change page table attributes in the linear mapping.
+ * @page: First page to change
+ * @numpages: Number of pages to change
+ * @prot: New protection/caching type (PAGE_*)
+ *
+ * Returns 0 on success, otherwise a negated errno.
+ *
+ * This should be used when a page is mapped with a different caching policy
+ * than write-back somewhere - some CPUs do not like it when mappings with
+ * different caching policies exist. This changes the page attributes of the
+ * in kernel linear mapping too.
+ *
+ * Caller must call global_flush_tlb() later to make the changes active.
+ *
+ * The caller needs to ensure that there are no conflicting mappings elsewhere
+ * (e.g. in user space) * This function only deals with the kernel linear map.
+ *
+ * For MMIO areas without mem_map use change_page_attr_addr() instead.
+ */
 int change_page_attr(struct page *page, int numpages, pgprot_t prot)
 {
 	unsigned long addr = (unsigned long)page_address(page);
 	return change_page_attr_addr(addr, numpages, prot);
 }
 
+/**
+ * global_flush_tlb - Flush linear mappings changed by change_page_attr()
+ */
 void global_flush_tlb(void)
 { 
 	struct page *pg, *next;
Index: linux/Documentation/DocBook/kernel-api.tmpl
===================================================================
--- linux.orig/Documentation/DocBook/kernel-api.tmpl
+++ linux/Documentation/DocBook/kernel-api.tmpl
@@ -726,4 +726,12 @@ X!Idrivers/video/console/fonts.c
 !Ffs/pipe.c
   </chapter>
 
+  <chapter id="pageattr">
+	<title>Kernel direct mapping paging attributes</title>
+  <para>
+	Changing the paging attributes of kernel direct mapping pages
+	on x86.
+  </para>
+!Farch/x86/mm/pageattr_64.c
+  </chapter>
 </book>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH CPA] [28/28] Make kernel_text test match boot mapping initialization
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (25 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [27/28] CPA: Change comments of external interfaces to kerneldoc format Andi Kleen
@ 2008-01-03 15:24 ` Andi Kleen
  2008-01-10  9:31 ` CPA patchset Ingo Molnar
  27 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-03 15:24 UTC (permalink / raw)
  To: linux-kernel


The boot direct mapping initialization used a different test to check if a 
page was part of the kernel mapping than c_p_a(). Fix that. Also 
round up to a large page size to be sure.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/mm/pageattr_32.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -183,6 +183,12 @@ static int cache_attr_changed(pte_t pte,
 	return a != (pgprot_val(prot) & _PAGE_CACHE);
 }
 
+static int text_address(unsigned long addr)
+{
+	unsigned long end = ((unsigned long)__init_end & LARGE_PAGE_MASK);
+	return addr < end + LARGE_PAGE_SIZE;
+}
+
 /*
  * Mark the address for flushing later in global_tlb_flush().
  *
@@ -237,7 +243,7 @@ __change_page_attr(struct page *page, pg
 	set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
 			level < 3);
 
-	if ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
+	if (text_address(address))
 		ref_prot = PAGE_KERNEL_EXEC;
 
 	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH CPA] [4/28] Add pte_clrhuge on i386
  2008-01-03 15:24 ` [PATCH CPA] [4/28] Add pte_clrhuge on i386 Andi Kleen
@ 2008-01-05  6:48   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 51+ messages in thread
From: Jeremy Fitzhardinge @ 2008-01-05  6:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Andi Kleen wrote:
> 64bit had it already
>   

I'm unifying pgtable.h at the moment.  Will post soon.

    J

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
                   ` (26 preceding siblings ...)
  2008-01-03 15:24 ` [PATCH CPA] [28/28] Make kernel_text test match boot mapping initialization Andi Kleen
@ 2008-01-10  9:31 ` Ingo Molnar
  2008-01-10  9:44   ` Dave Airlie
  2008-01-10  9:53   ` Andi Kleen
  27 siblings, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 2008-01-10  9:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Venki Pallipadi,
	suresh.b.siddha, Arjan van de Ven, Dave Jones


Andi,

finally managed to get the time to review your CPA patchset, and i 
fundamentally agree with most of the detail changes done in it. But here 
are a few structural high-level observations:

- firstly, there's no rationale given. So we'll change ioremap()/etc.
  from doing a cflush-range instruction instead of a WBINVD. But why?
  WBINVD isnt particular fast (takes a few msecs), but why is that a
  problem? Drivers dont do high-frequency ioremap-ing. It's typically
  only done at driver/device startup and that's it. Whether module load
  time takes 1254 msecs instead of 1250 msecs is no big deal.

- secondly, obviously doing a 'flush some' instead of a 'flush all' 
  operation is risky. There's not many ways we can get the 'flush all'
  WBINVD instruction wrong (as long as we do it on all cpus, etc.). But 
  with this specific range flush we've got all the risks of accidentally
  not flushing _enough_. Especially if some boundary of a mapped area is 
  imprecisely. Accidentally leaving aliases around in the cache is
  asking for trouble as it's very hard to debug and the operations here
  (ioremap) are typically done only once per system bootup. So we'll add
  _fundamental_ fragility to an already historically fragile and 
  bug-ridden piece of code. That does not sound too smart to me and you 
  do not analyze these concerns in your submission.

- the bugfixes and cleanups to pgattr.c i like very much - but shouldnt
  they come first in the series, instead of being mixed into it?

so i'd suggest for you to reshape this ontop of the PAT patchset done by 
Venki and Suresh, with the cflush stuff kept at the very end of the 
series, and even that should be boot and .config configurable, with 
default off - at least initially. The cflush performance advantages seem 
dubious at best, and they bring in very real regression dangers.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10  9:31 ` CPA patchset Ingo Molnar
@ 2008-01-10  9:44   ` Dave Airlie
  2008-01-10  9:55     ` Andi Kleen
  2008-01-10 10:02     ` Ingo Molnar
  2008-01-10  9:53   ` Andi Kleen
  1 sibling, 2 replies; 51+ messages in thread
From: Dave Airlie @ 2008-01-10  9:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

>
> finally managed to get the time to review your CPA patchset, and i
> fundamentally agree with most of the detail changes done in it. But here
> are a few structural high-level observations:
>
> - firstly, there's no rationale given. So we'll change ioremap()/etc.
>   from doing a cflush-range instruction instead of a WBINVD. But why?
>   WBINVD isnt particular fast (takes a few msecs), but why is that a
>   problem? Drivers dont do high-frequency ioremap-ing. It's typically
>   only done at driver/device startup and that's it. Whether module load
>   time takes 1254 msecs instead of 1250 msecs is no big deal.

read graphics drivers, even though I think we may avoid the whole path
if we can and end up doing some of this in the drivers
when they know more about the situation so can avoid safeties..

but I still see this being used in AGP a fair bit at some point on
some drivers..

Dave.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10  9:31 ` CPA patchset Ingo Molnar
  2008-01-10  9:44   ` Dave Airlie
@ 2008-01-10  9:53   ` Andi Kleen
  2008-01-10 10:04     ` Ingo Molnar
  2008-01-10 10:43     ` Ingo Molnar
  1 sibling, 2 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-10  9:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

On Thu, Jan 10, 2008 at 10:31:26AM +0100, Ingo Molnar wrote:
> 
> Andi,
> 
> finally managed to get the time to review your CPA patchset, and i 
> fundamentally agree with most of the detail changes done in it. But here 
> are a few structural high-level observations:

I have a few changes and will post a updated version shortly.
There's also unfortunately still one outstanding bug (triggeredb y
some recent changes) on 32bit that makes the self test suite not pass
currently.

> - firstly, there's no rationale given. So we'll change ioremap()/etc.
>   from doing a cflush-range instruction instead of a WBINVD. But why?

- WBINVD is a very nasty operation. I was talking to some CPU people
and they really recommended to get rid of it as far as possible.
Stopping the CPU for msecs is just wrong and there are apparently
even some theoretical live lock situations.
- It is not interruptible in earlier VT versions and messes up
real time in the hypervisor. Some people were doing KVM on rt 
kernels and had latency spikes from that.

I'll add that to the changelog.

>   WBINVD isnt particular fast (takes a few msecs), but why is that a
>   problem? Drivers dont do high-frequency ioremap-ing. It's typically

Actually graphics drivers can do higher frequency allocation of WC
memory (with PAT) support.

>   only done at driver/device startup and that's it. Whether module load
>   time takes 1254 msecs instead of 1250 msecs is no big deal.
> 
> - secondly, obviously doing a 'flush some' instead of a 'flush all' 
>   operation is risky. There's not many ways we can get the 'flush all'

Yes, that is why it took so long. But it's eventually needed
to make PAT actually useful for once.

>   WBINVD instruction wrong (as long as we do it on all cpus, etc.). But 
>   with this specific range flush we've got all the risks of accidentally
>   not flushing _enough_. Especially if some boundary of a mapped area is 
>   imprecisely.

Not sure what you mean here? If someone doesn't pass in the correct
area then not everything will be uncached in the first place. 
Using something as cached that should be uncached is usually
noticed because it tends to be quite slow or corrupt data.

I don't think you have thought that concern through.

>   asking for trouble as it's very hard to debug and the operations here
>   (ioremap) are typically done only once per system bootup. So we'll add

change_page_attr() is used for more than just ioremap (e.g. a common
case is mapping memory into AGP apertures which still happens in
many graphics drivers). I also expect it will be used even more
in the future when PAT usage will become more wide spread.

> - the bugfixes and cleanups to pgattr.c i like very much - but shouldnt
>   they come first in the series, instead of being mixed into it?

It all depends on each other; I don't think any of the changes can
be easily reordered.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10  9:44   ` Dave Airlie
@ 2008-01-10  9:55     ` Andi Kleen
  2008-01-10 10:20       ` Dave Airlie
  2008-01-10 10:02     ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2008-01-10  9:55 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Ingo Molnar, Andi Kleen, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Venki Pallipadi, suresh.b.siddha,
	Arjan van de Ven, Dave Jones

On Thu, Jan 10, 2008 at 07:44:03PM +1000, Dave Airlie wrote:
> >
> > finally managed to get the time to review your CPA patchset, and i
> > fundamentally agree with most of the detail changes done in it. But here
> > are a few structural high-level observations:
> >
> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> >   from doing a cflush-range instruction instead of a WBINVD. But why?
> >   WBINVD isnt particular fast (takes a few msecs), but why is that a
> >   problem? Drivers dont do high-frequency ioremap-ing. It's typically
> >   only done at driver/device startup and that's it. Whether module load
> >   time takes 1254 msecs instead of 1250 msecs is no big deal.
> 
> read graphics drivers, even though I think we may avoid the whole path

You mean avoid change_page_attr() ? 

> if we can and end up doing some of this in the drivers
> when they know more about the situation so can avoid safeties..

Please explain, but it sounds very dubious.

> but I still see this being used in AGP a fair bit at some point on
> some drivers..

Well GARTs and those are widely used even without AGP busses.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10  9:44   ` Dave Airlie
  2008-01-10  9:55     ` Andi Kleen
@ 2008-01-10 10:02     ` Ingo Molnar
  1 sibling, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2008-01-10 10:02 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones


* Dave Airlie <airlied@gmail.com> wrote:

> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> >   from doing a cflush-range instruction instead of a WBINVD. But why?
> >   WBINVD isnt particular fast (takes a few msecs), but why is that a
> >   problem? Drivers dont do high-frequency ioremap-ing. It's typically
> >   only done at driver/device startup and that's it. Whether module load
> >   time takes 1254 msecs instead of 1250 msecs is no big deal.
> 
> read graphics drivers, even though I think we may avoid the whole path 
> if we can and end up doing some of this in the drivers when they know 
> more about the situation so can avoid safeties..

by 'read graphics drivers' do you mean direct framebuffer access? In any 
case, a driver (or even userspace) can use cflush just fine, it's an 
unprivileged instruction. But this is about the ioremap() implementation 
using cflush instead of WBINVD, and that is a slowpath and is up to the 
kernel anyway - i'm not aware of many high-frequency ioremap() users, so 
robustness concerns control the policy here. So could you please explain 
your point in more detail?

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10  9:53   ` Andi Kleen
@ 2008-01-10 10:04     ` Ingo Molnar
  2008-01-10 10:07       ` Andi Kleen
  2008-01-10 10:43     ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2008-01-10 10:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Venki Pallipadi,
	suresh.b.siddha, Arjan van de Ven, Dave Jones


* Andi Kleen <ak@suse.de> wrote:

> >   WBINVD isnt particular fast (takes a few msecs), but why is that a 
> >   problem? Drivers dont do high-frequency ioremap-ing. It's 
> >   typically only done at driver/device startup and that's it.
> 
> Actually graphics drivers can do higher frequency allocation of WC 
> memory (with PAT) support.

but that's not too smart: why dont they use WB plus cflush instead? 
ioremap() will always be slow and we dont care about stupid uses of it.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10 10:04     ` Ingo Molnar
@ 2008-01-10 10:07       ` Andi Kleen
  2008-01-10 10:57         ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2008-01-10 10:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

On Thu, Jan 10, 2008 at 11:04:43AM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > >   WBINVD isnt particular fast (takes a few msecs), but why is that a 
> > >   problem? Drivers dont do high-frequency ioremap-ing. It's 
> > >   typically only done at driver/device startup and that's it.
> > 
> > Actually graphics drivers can do higher frequency allocation of WC 
> > memory (with PAT) support.
> 
> but that's not too smart: why dont they use WB plus cflush instead? 

Because they need to access it WC for performance.

> ioremap() will always be slow and we dont care about stupid uses of it.

Uninterruptible msec latencies are too slow even for slow path
initialization code imho.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10  9:55     ` Andi Kleen
@ 2008-01-10 10:20       ` Dave Airlie
  2008-01-10 10:50         ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Airlie @ 2008-01-10 10:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

On Jan 10, 2008 7:55 PM, Andi Kleen <ak@suse.de> wrote:
> On Thu, Jan 10, 2008 at 07:44:03PM +1000, Dave Airlie wrote:
> > >
> > > finally managed to get the time to review your CPA patchset, and i
> > > fundamentally agree with most of the detail changes done in it. But here
> > > are a few structural high-level observations:
> > >
> > > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > >   from doing a cflush-range instruction instead of a WBINVD. But why?
> > >   WBINVD isnt particular fast (takes a few msecs), but why is that a
> > >   problem? Drivers dont do high-frequency ioremap-ing. It's typically
> > >   only done at driver/device startup and that's it. Whether module load
> > >   time takes 1254 msecs instead of 1250 msecs is no big deal.
> >
> > read graphics drivers, even though I think we may avoid the whole path
>
> You mean avoid change_page_attr() ?

yes in some cases...

>
> > if we can and end up doing some of this in the drivers
> > when they know more about the situation so can avoid safeties..
>
> Please explain, but it sounds very dubious.

So on Intel 9xx hardware we know how to flush the whole pipe from chip
to chipset to RAM, so when in the Intel
 driver we can leave pages mapped cached, and just flush all stages
before having the GPU access them,

This is only possible as long as we know all the parts involved, for
example on AMD we have problems with that
over-eager prefetching so for drivers on AMD chipsets we have to do
something else more than likely using change_page_attr.

>
> Well GARTs and those are widely used even without AGP busses.

Yes mainly we have used this scheme with Intel GARTs (also the only
driver we have done with the new GPU system) by working closely with
Intel engineers..

Dave.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10  9:53   ` Andi Kleen
  2008-01-10 10:04     ` Ingo Molnar
@ 2008-01-10 10:43     ` Ingo Molnar
  2008-01-10 11:07       ` Andi Kleen
  1 sibling, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2008-01-10 10:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Venki Pallipadi,
	suresh.b.siddha, Arjan van de Ven, Dave Jones


* Andi Kleen <ak@suse.de> wrote:

> There's also unfortunately still one outstanding bug (triggeredb y 
> some recent changes) on 32bit that makes the self test suite not pass 
> currently.

ok. I'd expect there to be a whole lot of bugs triggered in this area. 
We have to shape it in a way that puts the most useful bits to the head 
and the riskiest bits to the end, so that we can do reverts in a 
sensible and least destructive way - if the need arises.

> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> >   from doing a cflush-range instruction instead of a WBINVD. But why?
> 
> - WBINVD is a very nasty operation. I was talking to some CPU people 
> and they really recommended to get rid of it as far as possible. 
> Stopping the CPU for msecs is just wrong and there are apparently even 
> some theoretical live lock situations. - It is not interruptible in 
> earlier VT versions and messes up real time in the hypervisor. Some 
> people were doing KVM on rt kernels and had latency spikes from that.

ok, i thought you might have had some higher rationale too. Frankly, 
WBINVD wont ever go away from the x86 instruction set, and while i am 
very symphathetic to rt and latency issues, it's still a risky change in 
a historically fragile area of code.

What is very real though are the hard limitations of MTRRs. So i'd 
rather first like to see a clean PAT approach (which all other modern 
OSs have already migrated to in the past 10 years), with all the 
structural cleanups and bugfixes you did as well, which would allow us 
to phase out MTRR use (of the DRM drivers, etc.), and _then_ layer an 
(optional) cflush approach basically as the final step. Right now cflush 
is mixed inextractably into the CPA patchset. WBINVD latency is really 
the last of our problems here.

> >   WBINVD isnt particular fast (takes a few msecs), but why is that a 
> >   problem? Drivers dont do high-frequency ioremap-ing. It's 
> >   typically
> 
> Actually graphics drivers can do higher frequency allocation of WC 
> memory (with PAT) support.

but that is plain stupid. If it can be WC mapped, why not map it WB and 
use cflush when updating memory? That gives cache-coherency on demand 
_and_ true read side caching - so it's faster as well. Modern GX cards 
can take that just fine. The best of both worlds. Unconditional WC is 
way overrated.

> > - secondly, obviously doing a 'flush some' instead of a 'flush all'
> >   operation is risky. There's not many ways we can get the 'flush 
> >   all'
> 
> Yes, that is why it took so long. But it's eventually needed to make 
> PAT actually useful for once.

as i see it, the utility of PAT comes from going away from MTRRs, and 
the resulting flexibility it gives system designers to shape memory and 
IO BARs.

> >   WBINVD instruction wrong (as long as we do it on all cpus, etc.). 
> >   But with this specific range flush we've got all the risks of 
> >   accidentally not flushing _enough_. Especially if some boundary of 
> >   a mapped area is imprecisely.
> 
> Not sure what you mean here? If someone doesn't pass in the correct 
> area then not everything will be uncached in the first place. Using 
> something as cached that should be uncached is usually noticed because 
> it tends to be quite slow or corrupt data.

yes, 'quite slow or corrupt data' is what i meant:

> >   asking for trouble as it's very hard to debug and the operations 
                                 ^^^^^^^^^^^^^^^^^^
> >   here (ioremap) are typically done only once per system bootup. 
> >   [...]
> 
> change_page_attr() is used for more than just ioremap (e.g. a common 
> case is mapping memory into AGP apertures which still happens in many 
> graphics drivers). I also expect it will be used even more in the 
> future when PAT usage will become more wide spread.

i expect usage to not change in pattern. ioremap()ping on the fly is not 
too smart.

[ quoting this from the top: ]

> > finally managed to get the time to review your CPA patchset, and i 
> > fundamentally agree with most of the detail changes done in it. But 
> > here are a few structural high-level observations:
> 
> I have a few changes and will post a updated version shortly.

thanks. I've test-merged the PAT patchset from Venki/Suresh to have a 
look, and to me the most logical way to layer these changes would be:

 - PAT
 - non-cflush bits of CPA
 - cflush bits of CPA
 - GBPAGES

hm?

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10 10:20       ` Dave Airlie
@ 2008-01-10 10:50         ` Andi Kleen
  0 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-10 10:50 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Venki Pallipadi, suresh.b.siddha,
	Arjan van de Ven, Dave Jones

On Thu, Jan 10, 2008 at 08:20:26PM +1000, Dave Airlie wrote:
> This is only possible as long as we know all the parts involved, for
> example on AMD we have problems with that
> over-eager prefetching so for drivers on AMD chipsets we have to do
> something else more than likely using change_page_attr.

What over eager prefetching? AFAIK AMD chips don't do any prefetching
that Intel chips don't do. In fact P4 does more aggressive speculation
than any AMD chip. 

One problem I see is that the CPU (both AMD and Intel) can follow random 
addresses even from speculative execution and then there is a window between
your CLFLUSH and eventual use where the data could be forced back
into cache behind your back. Then eventually the cache gets flushed
and overwrites the data in memory.

The only way to prevent that is to unmap or remap to uncached.

Might happen only rarely of course, but it can. We've had reports
over the years for bugs which were tracked down to such cases. Each
was a bitch to debug usually keeping people busy for a long time.

> 
> >
> > Well GARTs and those are widely used even without AGP busses.
> 
> Yes mainly we have used this scheme with Intel GARTs (also the only
> driver we have done with the new GPU system) by working closely with
> Intel engineers..

Have you discussed the case above?

Also BTW even if you can solve it for Intel GPUs there are others too
who really rely on WC/UC.

-Andi


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10 10:07       ` Andi Kleen
@ 2008-01-10 10:57         ` Ingo Molnar
  2008-01-10 11:12           ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2008-01-10 10:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Venki Pallipadi,
	suresh.b.siddha, Arjan van de Ven, Dave Jones


* Andi Kleen <ak@suse.de> wrote:

> > > >   WBINVD isnt particular fast (takes a few msecs), but why is 
> > > >   that a problem? Drivers dont do high-frequency ioremap-ing. 
> > > >   It's typically only done at driver/device startup and that's 
> > > >   it.
> > > 
> > > Actually graphics drivers can do higher frequency allocation of WC 
> > > memory (with PAT) support.
> > 
> > but that's not too smart: why dont they use WB plus cflush instead?
> 
> Because they need to access it WC for performance.

I think you have it fundamentally backwards: the best for performance is 
WB + cflush. What would WC offer for performance that cflush cannot do?

also, it's irrelevant to change_page_attr() call frequency. Just map in 
everything from the card and use it. In graphics, if you remap anything 
on the fly and it's not a slowpath you've lost the performance game even 
before you began it.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10 10:43     ` Ingo Molnar
@ 2008-01-10 11:07       ` Andi Kleen
  2008-01-10 12:22         ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2008-01-10 11:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

On Thu, Jan 10, 2008 at 11:43:51AM +0100, Ingo Molnar wrote:
> > > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > >   from doing a cflush-range instruction instead of a WBINVD. But why?
> > 
> > - WBINVD is a very nasty operation. I was talking to some CPU people 
> > and they really recommended to get rid of it as far as possible. 
> > Stopping the CPU for msecs is just wrong and there are apparently even 
> > some theoretical live lock situations. - It is not interruptible in 
> > earlier VT versions and messes up real time in the hypervisor. Some 
> > people were doing KVM on rt kernels and had latency spikes from that.
> 
> ok, i thought you might have had some higher rationale too. Frankly, 
> WBINVD wont ever go away from the x86 instruction set, and while i am 

We can (and should) eliminate it from normal operation at least. The one during
reboot has to be kept unfortunately.

> very symphathetic to rt and latency issues, it's still a risky change in 
> a historically fragile area of code.

It is, but the resulting code is significantly more efficient and does
various other things in better ways (the explicit list allows various
other optimizations). It's how c_p_a should have been written from day one. 


> What is very real though are the hard limitations of MTRRs. So i'd 
> rather first like to see a clean PAT approach (which all other modern 

That's mostly orthogonal. Don't know why you bring it up now?

Anyways more efficient c_p_a() makes PAT usage easier.

> structural cleanups and bugfixes you did as well, which would allow us 
> to phase out MTRR use (of the DRM drivers, etc.), and _then_ layer an 
> (optional) cflush approach basically as the final step. Right now cflush 
> is mixed inextractably into the CPA patchset. WBINVD latency is really 

You could always disable it. Just set full_flush = 1.
Anyways it would be possible to disable it more fully, but at least
the individual flush infrastructure is used by other code too and useful
for it.


> the last of our problems here.

I disagree on that.

> 
> > >   WBINVD isnt particular fast (takes a few msecs), but why is that a 
> > >   problem? Drivers dont do high-frequency ioremap-ing. It's 
> > >   typically
> > 
> > Actually graphics drivers can do higher frequency allocation of WC 
> > memory (with PAT) support.
> 
> but that is plain stupid. If it can be WC mapped, why not map it WB and 
> use cflush when updating memory? 

Not reliable due to the speculative out of order nature of x86 CPUs. 

See my other mail to Dave explaining one scenario where it can go 
wrong.

> > >   all'
> > 
> > Yes, that is why it took so long. But it's eventually needed to make 
> > PAT actually useful for once.
> 
> as i see it, the utility of PAT comes from going away from MTRRs, and 
> the resulting flexibility it gives system designers to shape memory and 
> IO BARs.

MTRRs will not go away, just used less.

One advantage of PAT is that you can set any page in memory efficiently
to WC or UC without requiring hacks like an hardware aperture. But to
make good use of that this operation needs to be reasonably efficient.
> 
> > >   WBINVD instruction wrong (as long as we do it on all cpus, etc.). 
> > >   But with this specific range flush we've got all the risks of 
> > >   accidentally not flushing _enough_. Especially if some boundary of 
> > >   a mapped area is imprecisely.
> > 
> > Not sure what you mean here? If someone doesn't pass in the correct 
> > area then not everything will be uncached in the first place. Using 
> > something as cached that should be uncached is usually noticed because 
> > it tends to be quite slow or corrupt data.
> 
> yes, 'quite slow or corrupt data' is what i meant:

That will already happen without my patchkit.  No change to the positive
or the negative.

> 
> > >   asking for trouble as it's very hard to debug and the operations 
>                                  ^^^^^^^^^^^^^^^^^^
> > >   here (ioremap) are typically done only once per system bootup. 
> > >   [...]
> > 
> > change_page_attr() is used for more than just ioremap (e.g. a common 
> > case is mapping memory into AGP apertures which still happens in many 
> > graphics drivers). I also expect it will be used even more in the 
> > future when PAT usage will become more wide spread.
> 
> i expect usage to not change in pattern. ioremap()ping on the fly is not 
> too smart.

Once PAT is in people will use it more I expect. And contrary to your
claims it is already used for more than just ioremap. Please grep
for it.

Actually in DEBUG_PAGEALLOC kernels it is currently heavily used,
but I eliminated that.

> 
> [ quoting this from the top: ]
> 
> > > finally managed to get the time to review your CPA patchset, and i 
> > > fundamentally agree with most of the detail changes done in it. But 
> > > here are a few structural high-level observations:
> > 
> > I have a few changes and will post a updated version shortly.
> 
> thanks. I've test-merged the PAT patchset from Venki/Suresh to have a 
> look, and to me the most logical way to layer these changes would be:
> 
>  - PAT
>  - non-cflush bits of CPA
>  - cflush bits of CPA

I would prefer to not reorder it significantly inside itself.

I took pains to make it all in small bisectable steps but redoing
all that would be very painful and time consuming for me. Also currently 
it is relatively logical in the way it builds on each other and dumb separation
in clflush vs no clflush wouldn't help with that.

A merge after PAT is fine for me if PAT happens soon (iirc there were
only minor changes to cpa anyways), although it's a pity to delay
some of the fixes for so long. But I haven't seen an PAT update recently
and at least the original patch needed much more work. So I think
you should only delay it for PAT if that happens soon.
I had actually hoped that CPA would be ready for .25, but I have my doubts
for PAT with that.

One possibility if you're worried about CLFLUSH so much would be to
merge the whole series but keep CLFLUSH disabled.  Don't think it would
improve things much though and the only way to get it shaked out is wider
testing anyways.


-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10 10:57         ` Ingo Molnar
@ 2008-01-10 11:12           ` Andi Kleen
  2008-01-11  7:19             ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2008-01-10 11:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

On Thu, Jan 10, 2008 at 11:57:26AM +0100, Ingo Molnar wrote:
> 
> > > > >   WBINVD isnt particular fast (takes a few msecs), but why is 
> > > > >   that a problem? Drivers dont do high-frequency ioremap-ing. 
> > > > >   It's typically only done at driver/device startup and that's 
> > > > >   it.
> > > > 
> > > > Actually graphics drivers can do higher frequency allocation of WC 
> > > > memory (with PAT) support.
> > > 
> > > but that's not too smart: why dont they use WB plus cflush instead?
> > 
> > Because they need to access it WC for performance.
> 
> I think you have it fundamentally backwards: the best for performance is 
> WB + cflush. What would WC offer for performance that cflush cannot do?

Cached requires the cache line to be read first before you can write it.

WC on the other hand does not allocate a cache line and just dumps
the data into a special write combining buffer.  It was invented originally 
because reads from AGP were incredibly slow.

And it's race less regarding the caching protocol (assuming you flush
the caches and TLBs correctly). 

Another typical problem is that if something
is uncached then you can't have it in any other caches because if that
cache eventually flushes it will corrupt the data.
That can happen with remapping apertures for example which remap data
behind the CPUs back.

CLFLUSH is really only a hint but it cannot be used if UC is needed
for correctness.

> also, it's irrelevant to change_page_attr() call frequency. Just map in 
> everything from the card and use it. In graphics, if you remap anything 
> on the fly and it's not a slowpath you've lost the performance game even 
> before you began it.

The typical case would be lots of user space DRI clients supplying
their own buffers on the fly. There's not really a fixed pool 
in this case, but it all varies dynamically. In some scenarios
that could happen quite often.

-Andi


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10 11:07       ` Andi Kleen
@ 2008-01-10 12:22         ` Ingo Molnar
  2008-01-10 12:39           ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2008-01-10 12:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Venki Pallipadi,
	suresh.b.siddha, Arjan van de Ven, Dave Jones


* Andi Kleen <ak@suse.de> wrote:

> > What is very real though are the hard limitations of MTRRs. So i'd 
> > rather first like to see a clean PAT approach (which all other 
> > modern OSs have already migrated to in the past 10 years)
> 
> That's mostly orthogonal. Don't know why you bring it up now?

because the PAT (Page Attribute Table support) patchset and the CPA 
(change_page_attr()) patchset are are not orthogonal at all - as their 
name already signals: because they change the implementation/effects of 
the same interface(s). [just at different levels].

Both patchsets change how the kernel pagetable caching is handled. PAT 
changes the kernel pte details and unshackles us from MTRR reliance and 
thus solves real problems on real boxes:

 55 files changed, 1288 insertions(+), 237 deletions(-)

CPA moves change_page_attr() from invwb flushing to cflush flushing, for 
a speedup/latency-win, plus a whole bunch of intermingled fixes and 
improvements to page attribute modification:

 26 files changed, 882 insertions(+), 423 deletions(-)

so in terms of risk management, the "perfect patch order" is:

 - minimal_set of correctness fixes to the highlevel cpa code.

 - ( then any provably NOP cleanups to pave the way. )

 - then change the lowlevel pte code (PAT) to reduce/eliminate the need 
   to have runtime MTRR use

 - then structural improvements/cleanups of the highlevel cpa code

 - then the cflush (optional) performance feature ontop of it.

 - then gigabyte-largepages/TLBs support [new CPU feature that further
   complicates page-attribute management]

All in an easy-to-revert fashion. We _will_ regress here, and this stuff 
is very hard to debug.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10 12:22         ` Ingo Molnar
@ 2008-01-10 12:39           ` Andi Kleen
  0 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-10 12:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

On Thu, Jan 10, 2008 at 01:22:04PM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > > What is very real though are the hard limitations of MTRRs. So i'd 
> > > rather first like to see a clean PAT approach (which all other 
> > > modern OSs have already migrated to in the past 10 years)
> > 
> > That's mostly orthogonal. Don't know why you bring it up now?
> 
> because the PAT (Page Attribute Table support) patchset and the CPA 
> (change_page_attr()) patchset are are not orthogonal at all - as their 
> name already signals: because they change the implementation/effects of 
> the same interface(s). [just at different levels].
> 
> Both patchsets change how the kernel pagetable caching is handled. PAT 
> changes the kernel pte details and unshackles us from MTRR reliance and 
> thus solves real problems on real boxes:
> 
>  55 files changed, 1288 insertions(+), 237 deletions(-)
> 
> CPA moves change_page_attr() from invwb flushing to cflush flushing, for 
> a speedup/latency-win, plus a whole bunch of intermingled fixes and 
> improvements to page attribute modification:
> 
>  26 files changed, 882 insertions(+), 423 deletions(-)
> 
> so in terms of risk management, the "perfect patch order" is:
> 
>  - minimal_set of correctness fixes to the highlevel cpa code.

The two clear bug fix patches are refcount and flush order.

refcnt could be moved earlier; flush order would be quite painful
because there are quite a lot of patches dependent on it.

I could move ref count earlier, but I would prefer not to because
of the significant work it would be for me. 

Since it is all already bisectable I'm also not sure what debugging
advantages the reordering would be. 

It's already bisectable (I have not booted all immediate steps,
but several of them and I believe all compile) and in small pieces for that.

If it's really broken it would need to be reverted and then the ref count 
stuff would go too.  But I hope that won't be needed. 

And even losing the reference count fixes wouldn't be catastrophic -- in the
worst case you lose some minor performance because kernel mappings are 
unnecessarily split to 4K pages, but it's not a correctness fix. 

So while the reordering would be possible, it would imho not
bring very much advantages and I must admit I'm not too motivated
to do another time-consuming reordering for a relatively weak reason.  

If it was a 5 patch series I would probably not complain too much
about this, but it's 25+ patches.

>  - ( then any provably NOP cleanups to pave the way. )
> 
>  - then change the lowlevel pte code (PAT) to reduce/eliminate the need 
>    to have runtime MTRR use

That's a completely different area really. Most of the PAT code
has nothing to do with clear_page_attr(). Please think that through
again after reading both patchkits.

As far as I can see making it wait for PAT would mean delaying
it for a longer time which would be a pity.


>  - then structural improvements/cleanups of the highlevel cpa code
> 
>  - then the cflush (optional) performance feature ontop of it.

There are actual a lot more performance features in there (self snoop,
minimal TLB flushing some other stuff). Most of it is related to 
that in fact.
 
> 
>  - then gigabyte-largepages/TLBs support [new CPU feature that further
>    complicates page-attribute management]

That's already at the end.

> 
> All in an easy-to-revert fashion. We _will_ regress here, and this stuff 

That's already the case.

-Andi


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-10 11:12           ` Andi Kleen
@ 2008-01-11  7:19             ` Ingo Molnar
  2008-01-11  7:33               ` Ingo Molnar
                                 ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Ingo Molnar @ 2008-01-11  7:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Venki Pallipadi,
	suresh.b.siddha, Arjan van de Ven, Dave Jones


* Andi Kleen <ak@suse.de> wrote:

> > > > but that's not too smart: why dont they use WB plus cflush 
> > > > instead?
> > > 
> > > Because they need to access it WC for performance.
> > 
> > I think you have it fundamentally backwards: the best for 
> > performance is WB + cflush. What would WC offer for performance that 
> > cflush cannot do?
> 
> Cached requires the cache line to be read first before you can write 
> it.

nonsense, and you should know it. It is perfectly possible to construct 
fully written cachelines, without reading the cacheline first. MOVDQ is 
SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
can generate full cacheline writes, _without_ filling in the cacheline 
first. Bulk ops (string ops, etc.) will do full cacheline writes too, 
without filling in the cacheline. Especially with high performance 3D 
ops we do _NOT_ need any funky reads from anywhere because 3D software 
can stream a lot of writes out: we construct a full frame or a portion 
of a frame, or upload vertices or shader scripts, textures, etc.

( also, _even_ when there is a cache fill pending on for a partially
  written cacheline, that might go on in parallel and it is not 
  necessarily holding up the CPU unless it has an actual data dependency 
  on that. )

but that's totally besides the point anyway. WC or WB accesses, if a 3D 
app or a driver does high-freq change_page_attr() calls, it will _lose_ 
the performance game:

> > also, it's irrelevant to change_page_attr() call frequency. Just map 
> > in everything from the card and use it. In graphics, if you remap 
> > anything on the fly and it's not a slowpath you've lost the 
> > performance game even before you began it.
> 
> The typical case would be lots of user space DRI clients supplying 
> their own buffers on the fly. There's not really a fixed pool in this 
> case, but it all varies dynamically. In some scenarios that could 
> happen quite often.

in what scenarios? Please give me in-tree examples of such high-freq 
change_page_attr() cases, where the driver authors would like to call it 
with high frequency but are unable to do it and see performance problems 
due to the WBINVD.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-11  7:19             ` Ingo Molnar
@ 2008-01-11  7:33               ` Ingo Molnar
  2008-01-11 11:28                 ` Andi Kleen
  2008-01-11 11:26               ` Andi Kleen
  2008-01-11 17:02               ` dean gaudet
  2 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2008-01-11  7:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, H. Peter Anvin, Venki Pallipadi,
	suresh.b.siddha, Arjan van de Ven, Dave Jones


* Ingo Molnar <mingo@elte.hu> wrote:

> > > I think you have it fundamentally backwards: the best for 
> > > performance is WB + cflush. What would WC offer for performance 
> > > that cflush cannot do?
> > 
> > Cached requires the cache line to be read first before you can write 
> > it.
> 
> nonsense, and you should know it. It is perfectly possible to 
> construct fully written cachelines, without reading the cacheline 
> first. MOVDQ is SSE1 so on basically in every CPU today - and it is 16 
> byte aligned and can generate full cacheline writes, _without_ filling 
> in the cacheline first. Bulk ops (string ops, etc.) will do full 
> cacheline writes too, without filling in the cacheline. Especially 
> with high performance 3D ops we do _NOT_ need any funky reads from 
> anywhere because 3D software can stream a lot of writes out: we 
> construct a full frame or a portion of a frame, or upload vertices or 
> shader scripts, textures, etc.
> 
> ( also, _even_ when there is a cache fill pending on for a partially
>   written cacheline, that might go on in parallel and it is not 
>   necessarily holding up the CPU unless it has an actual data dependency 
>   on that. )

btw., that's not to suggest that WC does not have its place - but it's 
not where you think it is.

Write-Combining can be very useful for devices that are behind a slow or 
a high-latency transport, such as PCI, and which are mapped UnCached 
today. Such as a networking card (or storage card) descriptor rings. New 
entries in the TX ring are typically constructed on the CPU side without 
knowing anything about their contents (and hence no need to read them, 
the drivers typically shadow the ring on the host side already, it is 
required for good UC performance today), so the new descriptors can be 
streamed out efficiently with WC, instead of generating lots of small UC 
accesses. But these descriptor rings have constant size, they have a 
fixed place in the aperture and are ioremap()ed once during device init 
and that's it. No high-frequency change_page_attributes() usage is done 
or desired here.

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-11  7:19             ` Ingo Molnar
  2008-01-11  7:33               ` Ingo Molnar
@ 2008-01-11 11:26               ` Andi Kleen
  2008-01-11 17:02               ` dean gaudet
  2 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-11 11:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

> It is perfectly possible to construct 
> fully written cachelines, without reading the cacheline first. MOVDQ is 

If you write a aligned full 64 (or 128) byte area and even then you can 
have occassional reads which can be either painfully slow or even incorrect.

> but that's totally besides the point anyway. WC or WB accesses, if a 3D 
> app or a driver does high-freq change_page_attr() calls, it will _lose_ 
> the performance game:

Yes, high frequency as in doing it in fast paths is not a good idea, but 
reasonably low frequency (as in acceptable process exit latencies for 
example) are something to aim for. Right now with WBINVD and other problems 
it is too slow.

> > > in everything from the card and use it. In graphics, if you remap 
> > > anything on the fly and it's not a slowpath you've lost the 
> > > performance game even before you began it.
> > 
> > The typical case would be lots of user space DRI clients supplying 
> > their own buffers on the fly. There's not really a fixed pool in this 
> > case, but it all varies dynamically. In some scenarios that could 
> > happen quite often.
> 
> in what scenarios? Please give me in-tree examples of such high-freq 
> change_page_attr() cases, where the driver authors would like to call it 
> with high frequency but are unable to do it and see performance problems 
> due to the WBINVD.

Some workloads do regular mapping into the GART aperture, but it is
not too critical yet.

But it is not too widely used because it is too slow; but i've got
requests from various parties over the years for more efficient c_p_a().
It's a chicken'n'egg problem -- you're asking for users but the users
don't use it yet because it's too slow.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-11  7:33               ` Ingo Molnar
@ 2008-01-11 11:28                 ` Andi Kleen
  0 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2008-01-11 11:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

> Write-Combining can be very useful for devices that are behind a slow or 
> a high-latency transport, such as PCI, and which are mapped UnCached 

That is what I wrote! If you meant the same we must have been
spectacularly miscommunicating.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-11  7:19             ` Ingo Molnar
  2008-01-11  7:33               ` Ingo Molnar
  2008-01-11 11:26               ` Andi Kleen
@ 2008-01-11 17:02               ` dean gaudet
  2008-01-11 17:18                 ` Arjan van de Ven
  2008-01-11 17:56                 ` dean gaudet
  2 siblings, 2 replies; 51+ messages in thread
From: dean gaudet @ 2008-01-11 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

On Fri, 11 Jan 2008, Ingo Molnar wrote:

> * Andi Kleen <ak@suse.de> wrote:
> 
> > Cached requires the cache line to be read first before you can write 
> > it.
> 
> nonsense, and you should know it. It is perfectly possible to construct 
> fully written cachelines, without reading the cacheline first. MOVDQ is 
> SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
> can generate full cacheline writes, _without_ filling in the cacheline 
> first.

did you mean to write MOVNTPS above?


> Bulk ops (string ops, etc.) will do full cacheline writes too, 
> without filling in the cacheline.

on intel with fast strings enabled yes.  mind you intel gives hints in
the documentation these operations don't respect coherence... and i
asked about this when they posted their memory ordering paper but got no
response.

-dean

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-11 17:02               ` dean gaudet
@ 2008-01-11 17:18                 ` Arjan van de Ven
  2008-01-11 17:56                 ` dean gaudet
  1 sibling, 0 replies; 51+ messages in thread
From: Arjan van de Ven @ 2008-01-11 17:18 UTC (permalink / raw)
  To: dean gaudet
  Cc: Ingo Molnar, Andi Kleen, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Venki Pallipadi, suresh.b.siddha, Dave Jones

On Fri, 11 Jan 2008 09:02:46 -0800 (PST)
dean gaudet <dean@arctic.org> wrote:

> > Bulk ops (string ops, etc.) will do full cacheline writes too, 
> > without filling in the cacheline.
> 
> on intel with fast strings enabled yes.  mind you intel gives hints in
> the documentation these operations don't respect coherence... and i
> asked about this when they posted their memory ordering paper but got
> no response.

I know there's an answer on the way; it just needs to get checked by
architects for cpus we shipped in like the last 10 years ;)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: CPA patchset
  2008-01-11 17:02               ` dean gaudet
  2008-01-11 17:18                 ` Arjan van de Ven
@ 2008-01-11 17:56                 ` dean gaudet
  1 sibling, 0 replies; 51+ messages in thread
From: dean gaudet @ 2008-01-11 17:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Venki Pallipadi, suresh.b.siddha, Arjan van de Ven, Dave Jones

On Fri, 11 Jan 2008, dean gaudet wrote:

> On Fri, 11 Jan 2008, Ingo Molnar wrote:
> 
> > * Andi Kleen <ak@suse.de> wrote:
> > 
> > > Cached requires the cache line to be read first before you can write 
> > > it.
> > 
> > nonsense, and you should know it. It is perfectly possible to construct 
> > fully written cachelines, without reading the cacheline first. MOVDQ is 
> > SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
> > can generate full cacheline writes, _without_ filling in the cacheline 
> > first.
> 
> did you mean to write MOVNTPS above?

btw in case you were thinking a normal store to WB rather than a 
non-temporal store... i ran a microbenchmark streaming stores to every 16 
bytes of a 16MiB region aligned to 4096 bytes on a xeon 53xx series CPU 
(4MiB L2) + 5000X northbridge and the avg latency of MOVNTPS is 12 cycles 
whereas the avg latency of MOVAPS is 20 cycles.

the inner loop is unrolled 16 times so there are literally 4 cache lines 
worth of stores being stuffed into the store queue as fast as possible... 
and there's no coalescing for normal stores even on this modern CPU.

i'm certain i'll see the same thing on AMD... it's a very hard thing to do 
in hardware without the non-temporal hint.

-dean



^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2008-01-11 17:57 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-03 15:24 [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [2/28] CPA: Do a simple self test at boot Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [3/28] Add pte accessors for the global bit Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [4/28] Add pte_clrhuge on i386 Andi Kleen
2008-01-05  6:48   ` Jeremy Fitzhardinge
2008-01-03 15:24 ` [PATCH CPA] [5/28] Add pte_pgprot " Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [6/28] CPA: Undo white space changes Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [7/28] Extract page table dumping code from i386 fault handler into dump_pagetable() Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [8/28] CPA: Return the page table level in lookup_address() Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [9/28] CPA: Add simple self test at boot Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [10/28] CPA: Change kernel_map_pages to not use c_p_a() Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [11/28] CPA: Change 32bit back to init_mm semaphore locking Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [12/28] CPA: CLFLUSH support in change_page_attr() Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [13/28] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [14/28] CPA: Use page granuality TLB flushing " Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [15/28] CPA: Don't flush the caches when the CPU supports self-snoop Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [16/28] CPA: Use wbinvd() macro instead of inline assembly in 64bit c_p_a() Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [17/28] CPA: Reorder TLB / cache flushes to follow Intel recommendation Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [18/28] CPA: Make change_page_attr() more robust against use of PAT bits Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [19/28] CPA: Limit cache flushing to pages that really change caching Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [20/28] CPA: Fix inaccurate comments in 64bit change_page_attr() Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [21/28] CPA: Dump pagetable when inconsistency is detected Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [22/28] CPA: Only queue actually unused page table pages for freeing Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [23/28] CPA: Remove unnecessary masking of address Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [24/28] CPA: Only unmap kernel init pages in text mapping when CONFIG_DEBUG_RODATA is set Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [25/28] CPA: Always do full TLB flush when splitting large pages Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [26/28] CPA: Fix reference counting when changing already changed pages Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [27/28] CPA: Change comments of external interfaces to kerneldoc format Andi Kleen
2008-01-03 15:24 ` [PATCH CPA] [28/28] Make kernel_text test match boot mapping initialization Andi Kleen
2008-01-10  9:31 ` CPA patchset Ingo Molnar
2008-01-10  9:44   ` Dave Airlie
2008-01-10  9:55     ` Andi Kleen
2008-01-10 10:20       ` Dave Airlie
2008-01-10 10:50         ` Andi Kleen
2008-01-10 10:02     ` Ingo Molnar
2008-01-10  9:53   ` Andi Kleen
2008-01-10 10:04     ` Ingo Molnar
2008-01-10 10:07       ` Andi Kleen
2008-01-10 10:57         ` Ingo Molnar
2008-01-10 11:12           ` Andi Kleen
2008-01-11  7:19             ` Ingo Molnar
2008-01-11  7:33               ` Ingo Molnar
2008-01-11 11:28                 ` Andi Kleen
2008-01-11 11:26               ` Andi Kleen
2008-01-11 17:02               ` dean gaudet
2008-01-11 17:18                 ` Arjan van de Ven
2008-01-11 17:56                 ` dean gaudet
2008-01-10 10:43     ` Ingo Molnar
2008-01-10 11:07       ` Andi Kleen
2008-01-10 12:22         ` Ingo Molnar
2008-01-10 12:39           ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).