All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
@ 2013-01-22 21:24 ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen

This series fixes a hard-to-debug early boot hang on 32-bit
NUMA systems.  It adds coverage to the debugging code,
adds some helpers, and eventually fixes the original bug I
was hitting.

[v3]
 * Remove unnecessary initializations in slow_virt_to_phys(),
   and convert 'level' to use the new enum pg_level.
 * Add some text to slow_virt_to_phys() to make it clear
   that we are not using it in any remotely fast paths.
[v2]
 * Moved DEBUG_VIRTUAL patch earlier in the series (it has no
   dependencies on anything else and stands on its own.
 * Created page_level_*() helpers to replace a nasty switch()
   at hpa's suggestion
 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
@ 2013-01-22 21:24 ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen

This series fixes a hard-to-debug early boot hang on 32-bit
NUMA systems.  It adds coverage to the debugging code,
adds some helpers, and eventually fixes the original bug I
was hitting.

[v3]
 * Remove unnecessary initializations in slow_virt_to_phys(),
   and convert 'level' to use the new enum pg_level.
 * Add some text to slow_virt_to_phys() to make it clear
   that we are not using it in any remotely fast paths.
[v2]
 * Moved DEBUG_VIRTUAL patch earlier in the series (it has no
   dependencies on anything else and stands on its own.
 * Created page_level_*() helpers to replace a nasty switch()
   at hpa's suggestion
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/5] make DEBUG_VIRTUAL work earlier in boot
  2013-01-22 21:24 ` Dave Hansen
@ 2013-01-22 21:24   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


The KVM code has some repeated bugs in it around use of __pa() on
per-cpu data.  Those data are not in an area on which using
__pa() is valid.  However, they are also called early enough in
boot that __vmalloc_start_set is not set, and thus the
CONFIG_DEBUG_VIRTUAL debugging does not catch them.

This adds a check to also verify __pa() calls against max_low_pfn,
which we can use earler in boot than is_vmalloc_addr().  However,
if we are super-early in boot, max_low_pfn=0 and this will trip
on every call, so also make sure that max_low_pfn is set before
we try to use it.

With this patch applied, CONFIG_DEBUG_VIRTUAL will actually
catch the bug I was chasing (and fix later in this series).

I'd love to find a generic way so that any __pa() call on percpu
areas could do a BUG_ON(), but there don't appear to be any nice
and easy ways to check if an address is a percpu one.  Anybody
have ideas on a way to do this?

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/mm/numa.c     |    2 +-
 linux-2.6.git-dave/arch/x86/mm/pat.c      |    4 ++--
 linux-2.6.git-dave/arch/x86/mm/physaddr.c |    9 ++++++++-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff -puN arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/numa.c
--- linux-2.6.git/arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-22 13:17:15.128306679 -0800
+++ linux-2.6.git-dave/arch/x86/mm/numa.c	2013-01-22 13:17:15.136306745 -0800
@@ -219,7 +219,7 @@ static void __init setup_node_data(int n
 	 */
 	nd = alloc_remap(nid, nd_size);
 	if (nd) {
-		nd_pa = __pa(nd);
+		nd_pa = __phys_addr_nodebug(nd);
 		remapped = true;
 	} else {
 		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
diff -puN arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/pat.c
--- linux-2.6.git/arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-22 13:17:15.128306679 -0800
+++ linux-2.6.git-dave/arch/x86/mm/pat.c	2013-01-22 13:17:15.136306745 -0800
@@ -560,10 +560,10 @@ int kernel_map_sync_memtype(u64 base, un
 {
 	unsigned long id_sz;
 
-	if (base >= __pa(high_memory))
+	if (base > __pa(high_memory-1))
 		return 0;
 
-	id_sz = (__pa(high_memory) < base + size) ?
+	id_sz = (__pa(high_memory-1) <= base + size) ?
 				__pa(high_memory) - base :
 				size;
 
diff -puN arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/physaddr.c
--- linux-2.6.git/arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-22 13:17:15.132306712 -0800
+++ linux-2.6.git-dave/arch/x86/mm/physaddr.c	2013-01-22 13:17:15.136306745 -0800
@@ -1,3 +1,4 @@
+#include <linux/bootmem.h>
 #include <linux/mmdebug.h>
 #include <linux/module.h>
 #include <linux/mm.h>
@@ -47,10 +48,16 @@ EXPORT_SYMBOL(__virt_addr_valid);
 #ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
+	unsigned long phys_addr = x - PAGE_OFFSET;
 	/* VMALLOC_* aren't constants  */
 	VIRTUAL_BUG_ON(x < PAGE_OFFSET);
 	VIRTUAL_BUG_ON(__vmalloc_start_set && is_vmalloc_addr((void *) x));
-	return x - PAGE_OFFSET;
+	/* max_low_pfn is set early, but not _that_ early */
+	if (max_low_pfn) {
+		VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
+		BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
+	}
+	return phys_addr;
 }
 EXPORT_SYMBOL(__phys_addr);
 #endif
_


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/5] make DEBUG_VIRTUAL work earlier in boot
@ 2013-01-22 21:24   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


The KVM code has some repeated bugs in it around use of __pa() on
per-cpu data.  Those data are not in an area on which using
__pa() is valid.  However, they are also called early enough in
boot that __vmalloc_start_set is not set, and thus the
CONFIG_DEBUG_VIRTUAL debugging does not catch them.

This adds a check to also verify __pa() calls against max_low_pfn,
which we can use earler in boot than is_vmalloc_addr().  However,
if we are super-early in boot, max_low_pfn=0 and this will trip
on every call, so also make sure that max_low_pfn is set before
we try to use it.

With this patch applied, CONFIG_DEBUG_VIRTUAL will actually
catch the bug I was chasing (and fix later in this series).

I'd love to find a generic way so that any __pa() call on percpu
areas could do a BUG_ON(), but there don't appear to be any nice
and easy ways to check if an address is a percpu one.  Anybody
have ideas on a way to do this?

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/mm/numa.c     |    2 +-
 linux-2.6.git-dave/arch/x86/mm/pat.c      |    4 ++--
 linux-2.6.git-dave/arch/x86/mm/physaddr.c |    9 ++++++++-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff -puN arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/numa.c
--- linux-2.6.git/arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-22 13:17:15.128306679 -0800
+++ linux-2.6.git-dave/arch/x86/mm/numa.c	2013-01-22 13:17:15.136306745 -0800
@@ -219,7 +219,7 @@ static void __init setup_node_data(int n
 	 */
 	nd = alloc_remap(nid, nd_size);
 	if (nd) {
-		nd_pa = __pa(nd);
+		nd_pa = __phys_addr_nodebug(nd);
 		remapped = true;
 	} else {
 		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
diff -puN arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/pat.c
--- linux-2.6.git/arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-22 13:17:15.128306679 -0800
+++ linux-2.6.git-dave/arch/x86/mm/pat.c	2013-01-22 13:17:15.136306745 -0800
@@ -560,10 +560,10 @@ int kernel_map_sync_memtype(u64 base, un
 {
 	unsigned long id_sz;
 
-	if (base >= __pa(high_memory))
+	if (base > __pa(high_memory-1))
 		return 0;
 
-	id_sz = (__pa(high_memory) < base + size) ?
+	id_sz = (__pa(high_memory-1) <= base + size) ?
 				__pa(high_memory) - base :
 				size;
 
diff -puN arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/physaddr.c
--- linux-2.6.git/arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-22 13:17:15.132306712 -0800
+++ linux-2.6.git-dave/arch/x86/mm/physaddr.c	2013-01-22 13:17:15.136306745 -0800
@@ -1,3 +1,4 @@
+#include <linux/bootmem.h>
 #include <linux/mmdebug.h>
 #include <linux/module.h>
 #include <linux/mm.h>
@@ -47,10 +48,16 @@ EXPORT_SYMBOL(__virt_addr_valid);
 #ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
+	unsigned long phys_addr = x - PAGE_OFFSET;
 	/* VMALLOC_* aren't constants  */
 	VIRTUAL_BUG_ON(x < PAGE_OFFSET);
 	VIRTUAL_BUG_ON(__vmalloc_start_set && is_vmalloc_addr((void *) x));
-	return x - PAGE_OFFSET;
+	/* max_low_pfn is set early, but not _that_ early */
+	if (max_low_pfn) {
+		VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
+		BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
+	}
+	return phys_addr;
 }
 EXPORT_SYMBOL(__phys_addr);
 #endif
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/5] pagetable level size/shift/mask helpers
  2013-01-22 21:24 ` Dave Hansen
@ 2013-01-22 21:24   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


I plan to use lookup_address() to walk the kernel pagetables
in a later patch.  It returns a "pte" and the level in the
pagetables where the "pte" was found.  The level is just an
enum and needs to be converted to a useful value in order to
do address calculations with it.  These helpers will be used
in at least two places.

This also gives the anonymous enum a real name so that no one
gets confused about what they should be passing in to these
helpers.

"PTE_SHIFT" was chosen for naming consistency with the other
pagetable levels (PGD/PUD/PMD_SHIFT).

Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/pgtable.h       |   14 ++++++++++++++
 linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    2 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable.h~pagetable-level-size-helpers arch/x86/include/asm/pgtable.h
--- linux-2.6.git/arch/x86/include/asm/pgtable.h~pagetable-level-size-helpers	2013-01-22 13:17:15.464309478 -0800
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable.h	2013-01-22 13:17:15.472309544 -0800
@@ -390,6 +390,7 @@ pte_t *populate_extra_pte(unsigned long
 
 #ifndef __ASSEMBLY__
 #include <linux/mm_types.h>
+#include <linux/log2.h>
 
 static inline int pte_none(pte_t pte)
 {
@@ -781,6 +782,19 @@ static inline void clone_pgd_range(pgd_t
        memcpy(dst, src, count * sizeof(pgd_t));
 }
 
+#define PTE_SHIFT ilog2(PTRS_PER_PTE)
+static inline int page_level_shift(enum pg_level level)
+{
+	return (PAGE_SHIFT - PTE_SHIFT) + level * PTE_SHIFT;
+}
+static inline unsigned long page_level_size(enum pg_level level)
+{
+	return 1UL << page_level_shift(level);
+}
+static inline unsigned long page_level_mask(enum pg_level level)
+{
+	return ~(page_level_size(level) - 1);
+}
 
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
diff -puN arch/x86/include/asm/pgtable_types.h~pagetable-level-size-helpers arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~pagetable-level-size-helpers	2013-01-22 13:17:15.468309511 -0800
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-22 13:17:15.472309544 -0800
@@ -331,7 +331,7 @@ extern void native_pagetable_init(void);
 struct seq_file;
 extern void arch_report_meminfo(struct seq_file *m);
 
-enum {
+enum pg_level {
 	PG_LEVEL_NONE,
 	PG_LEVEL_4K,
 	PG_LEVEL_2M,
_


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/5] pagetable level size/shift/mask helpers
@ 2013-01-22 21:24   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


I plan to use lookup_address() to walk the kernel pagetables
in a later patch.  It returns a "pte" and the level in the
pagetables where the "pte" was found.  The level is just an
enum and needs to be converted to a useful value in order to
do address calculations with it.  These helpers will be used
in at least two places.

This also gives the anonymous enum a real name so that no one
gets confused about what they should be passing in to these
helpers.

"PTE_SHIFT" was chosen for naming consistency with the other
pagetable levels (PGD/PUD/PMD_SHIFT).

Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/pgtable.h       |   14 ++++++++++++++
 linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    2 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable.h~pagetable-level-size-helpers arch/x86/include/asm/pgtable.h
--- linux-2.6.git/arch/x86/include/asm/pgtable.h~pagetable-level-size-helpers	2013-01-22 13:17:15.464309478 -0800
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable.h	2013-01-22 13:17:15.472309544 -0800
@@ -390,6 +390,7 @@ pte_t *populate_extra_pte(unsigned long
 
 #ifndef __ASSEMBLY__
 #include <linux/mm_types.h>
+#include <linux/log2.h>
 
 static inline int pte_none(pte_t pte)
 {
@@ -781,6 +782,19 @@ static inline void clone_pgd_range(pgd_t
        memcpy(dst, src, count * sizeof(pgd_t));
 }
 
+#define PTE_SHIFT ilog2(PTRS_PER_PTE)
+static inline int page_level_shift(enum pg_level level)
+{
+	return (PAGE_SHIFT - PTE_SHIFT) + level * PTE_SHIFT;
+}
+static inline unsigned long page_level_size(enum pg_level level)
+{
+	return 1UL << page_level_shift(level);
+}
+static inline unsigned long page_level_mask(enum pg_level level)
+{
+	return ~(page_level_size(level) - 1);
+}
 
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
diff -puN arch/x86/include/asm/pgtable_types.h~pagetable-level-size-helpers arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~pagetable-level-size-helpers	2013-01-22 13:17:15.468309511 -0800
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-22 13:17:15.472309544 -0800
@@ -331,7 +331,7 @@ extern void native_pagetable_init(void);
 struct seq_file;
 extern void arch_report_meminfo(struct seq_file *m);
 
-enum {
+enum pg_level {
 	PG_LEVEL_NONE,
 	PG_LEVEL_4K,
 	PG_LEVEL_2M,
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/5] use new pagetable helpers in try_preserve_large_page()
  2013-01-22 21:24 ` Dave Hansen
@ 2013-01-22 21:24   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


try_preserve_large_page() can be slightly simplified by using
the new page_level_*() helpers.  This also moves the 'level'
over to the new pg_level enum type.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/mm/pageattr.c |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff -puN arch/x86/mm/pageattr.c~use-new-pagetable-helpers arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~use-new-pagetable-helpers	2013-01-22 13:17:15.792312210 -0800
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-22 13:17:15.796312243 -0800
@@ -396,7 +396,7 @@ try_preserve_large_page(pte_t *kpte, uns
 	pte_t new_pte, old_pte, *tmp;
 	pgprot_t old_prot, new_prot, req_prot;
 	int i, do_split = 1;
-	unsigned int level;
+	enum pg_level level;
 
 	if (cpa->force_split)
 		return 1;
@@ -412,15 +412,12 @@ try_preserve_large_page(pte_t *kpte, uns
 
 	switch (level) {
 	case PG_LEVEL_2M:
-		psize = PMD_PAGE_SIZE;
-		pmask = PMD_PAGE_MASK;
-		break;
 #ifdef CONFIG_X86_64
 	case PG_LEVEL_1G:
-		psize = PUD_PAGE_SIZE;
-		pmask = PUD_PAGE_MASK;
-		break;
 #endif
+		psize = page_level_size(level);
+		pmask = page_level_mask(level);
+		break;
 	default:
 		do_split = -EINVAL;
 		goto out_unlock;
_


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/5] use new pagetable helpers in try_preserve_large_page()
@ 2013-01-22 21:24   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


try_preserve_large_page() can be slightly simplified by using
the new page_level_*() helpers.  This also moves the 'level'
over to the new pg_level enum type.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/mm/pageattr.c |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff -puN arch/x86/mm/pageattr.c~use-new-pagetable-helpers arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~use-new-pagetable-helpers	2013-01-22 13:17:15.792312210 -0800
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-22 13:17:15.796312243 -0800
@@ -396,7 +396,7 @@ try_preserve_large_page(pte_t *kpte, uns
 	pte_t new_pte, old_pte, *tmp;
 	pgprot_t old_prot, new_prot, req_prot;
 	int i, do_split = 1;
-	unsigned int level;
+	enum pg_level level;
 
 	if (cpa->force_split)
 		return 1;
@@ -412,15 +412,12 @@ try_preserve_large_page(pte_t *kpte, uns
 
 	switch (level) {
 	case PG_LEVEL_2M:
-		psize = PMD_PAGE_SIZE;
-		pmask = PMD_PAGE_MASK;
-		break;
 #ifdef CONFIG_X86_64
 	case PG_LEVEL_1G:
-		psize = PUD_PAGE_SIZE;
-		pmask = PUD_PAGE_MASK;
-		break;
 #endif
+		psize = page_level_size(level);
+		pmask = page_level_mask(level);
+		break;
 	default:
 		do_split = -EINVAL;
 		goto out_unlock;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/5] create slow_virt_to_phys()
  2013-01-22 21:24 ` Dave Hansen
@ 2013-01-22 21:24   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems.  We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages.  We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).

In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area.  Any __pa() call on them will break
and basically return garbage.

This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory.  It should work on the normal linear mapping,
vmalloc(), kmap(), etc...

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    1 
 linux-2.6.git-dave/arch/x86/mm/pageattr.c               |   31 ++++++++++++++++
 2 files changed, 32 insertions(+)

diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys	2013-01-22 13:17:16.100314776 -0800
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-22 13:17:16.104314809 -0800
@@ -352,6 +352,7 @@ static inline void update_page_count(int
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);
 
 #endif	/* !__ASSEMBLY__ */
 
diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys	2013-01-22 13:17:16.100314776 -0800
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-22 13:17:16.104314809 -0800
@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr
 EXPORT_SYMBOL_GPL(lookup_address);
 
 /*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems.  The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+	unsigned long virt_addr = (unsigned long)__virt_addr;
+	phys_addr_t phys_addr;
+	unsigned long offset;
+	enum pg_level level;
+	unsigned long psize;
+	unsigned long pmask;
+	pte_t *pte;
+
+	pte = lookup_address(virt_addr, &level);
+	BUG_ON(!pte);
+	psize = page_level_size(level);
+	pmask = page_level_mask(level);
+	offset = virt_addr & ~pmask;
+	phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+	return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
  * Set the new pmd in all the pgds we know about:
  */
 static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
_


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/5] create slow_virt_to_phys()
@ 2013-01-22 21:24   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems.  We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages.  We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).

In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area.  Any __pa() call on them will break
and basically return garbage.

This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory.  It should work on the normal linear mapping,
vmalloc(), kmap(), etc...

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    1 
 linux-2.6.git-dave/arch/x86/mm/pageattr.c               |   31 ++++++++++++++++
 2 files changed, 32 insertions(+)

diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys	2013-01-22 13:17:16.100314776 -0800
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-22 13:17:16.104314809 -0800
@@ -352,6 +352,7 @@ static inline void update_page_count(int
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);
 
 #endif	/* !__ASSEMBLY__ */
 
diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys	2013-01-22 13:17:16.100314776 -0800
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-22 13:17:16.104314809 -0800
@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr
 EXPORT_SYMBOL_GPL(lookup_address);
 
 /*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems.  The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+	unsigned long virt_addr = (unsigned long)__virt_addr;
+	phys_addr_t phys_addr;
+	unsigned long offset;
+	enum pg_level level;
+	unsigned long psize;
+	unsigned long pmask;
+	pte_t *pte;
+
+	pte = lookup_address(virt_addr, &level);
+	BUG_ON(!pte);
+	psize = page_level_size(level);
+	pmask = page_level_mask(level);
+	offset = virt_addr & ~pmask;
+	phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+	return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
  * Set the new pmd in all the pgds we know about:
  */
 static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/5] fix kvm's use of __pa() on percpu areas
  2013-01-22 21:24 ` Dave Hansen
@ 2013-01-22 21:24   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


In short, it is illegal to call __pa() on an address holding
a percpu variable.  This replaces those __pa() calls with
slow_virt_to_phys().  All of the cases in this patch are
in boot time (or CPU hotplug time at worst) code, so the
slow pagetable walking in slow_virt_to_phys() is not expected
to have a performance impact.

The times when this actually matters are pretty obscure
(certain 32-bit NUMA systems), but it _does_ happen.  It is
important to keep KVM guests working on these systems because
the real hardware is getting harder and harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

	CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[    0.000000] BUG: unable to handle kernel paging request at ffffffff

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host.  Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.  The behavior varied,
though, depending on what got corrupted by the bad write.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

 linux-2.6.git-dave/arch/x86/kernel/kvm.c      |    9 +++++----
 linux-2.6.git-dave/arch/x86/kernel/kvmclock.c |    4 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c
--- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas	2013-01-22 13:17:16.424317475 -0800
+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c	2013-01-22 13:17:16.432317541 -0800
@@ -289,9 +289,9 @@ static void kvm_register_steal_time(void
 
 	memset(st, 0, sizeof(*st));
 
-	wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+	wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
 	printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
-		cpu, __pa(st));
+		cpu, slow_virt_to_phys(st));
 }
 
 static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
@@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void)
 		return;
 
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
-		u64 pa = __pa(&__get_cpu_var(apf_reason));
+		u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
 
 #ifdef CONFIG_PREEMPT
 		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void)
 		/* Size alignment is implied but just to make it explicit. */
 		BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
 		__get_cpu_var(kvm_apic_eoi) = 0;
-		pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
+		pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
+			| KVM_MSR_ENABLED;
 		wrmsrl(MSR_KVM_PV_EOI_EN, pa);
 	}
 
diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c
--- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas	2013-01-22 13:17:16.428317508 -0800
+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c	2013-01-22 13:17:16.432317541 -0800
@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
 	int low, high, ret;
 	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
 
-	low = (int)__pa(src) | 1;
-	high = ((u64)__pa(src) >> 32);
+	low = (int)slow_virt_to_phys(src) | 1;
+	high = ((u64)slow_virt_to_phys(src) >> 32);
 	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
 	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
 	       cpu, high, low, txt);
_


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/5] fix kvm's use of __pa() on percpu areas
@ 2013-01-22 21:24   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-22 21:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


In short, it is illegal to call __pa() on an address holding
a percpu variable.  This replaces those __pa() calls with
slow_virt_to_phys().  All of the cases in this patch are
in boot time (or CPU hotplug time at worst) code, so the
slow pagetable walking in slow_virt_to_phys() is not expected
to have a performance impact.

The times when this actually matters are pretty obscure
(certain 32-bit NUMA systems), but it _does_ happen.  It is
important to keep KVM guests working on these systems because
the real hardware is getting harder and harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

	CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[    0.000000] BUG: unable to handle kernel paging request at ffffffff

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host.  Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.  The behavior varied,
though, depending on what got corrupted by the bad write.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

 linux-2.6.git-dave/arch/x86/kernel/kvm.c      |    9 +++++----
 linux-2.6.git-dave/arch/x86/kernel/kvmclock.c |    4 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c
--- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas	2013-01-22 13:17:16.424317475 -0800
+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c	2013-01-22 13:17:16.432317541 -0800
@@ -289,9 +289,9 @@ static void kvm_register_steal_time(void
 
 	memset(st, 0, sizeof(*st));
 
-	wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+	wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
 	printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
-		cpu, __pa(st));
+		cpu, slow_virt_to_phys(st));
 }
 
 static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
@@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void)
 		return;
 
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
-		u64 pa = __pa(&__get_cpu_var(apf_reason));
+		u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
 
 #ifdef CONFIG_PREEMPT
 		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void)
 		/* Size alignment is implied but just to make it explicit. */
 		BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
 		__get_cpu_var(kvm_apic_eoi) = 0;
-		pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
+		pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
+			| KVM_MSR_ENABLED;
 		wrmsrl(MSR_KVM_PV_EOI_EN, pa);
 	}
 
diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c
--- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas	2013-01-22 13:17:16.428317508 -0800
+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c	2013-01-22 13:17:16.432317541 -0800
@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
 	int low, high, ret;
 	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
 
-	low = (int)__pa(src) | 1;
-	high = ((u64)__pa(src) >> 32);
+	low = (int)slow_virt_to_phys(src) | 1;
+	high = ((u64)slow_virt_to_phys(src) >> 32);
 	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
 	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
 	       cpu, high, low, txt);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas
  2013-01-22 21:24   ` Dave Hansen
@ 2013-01-23  0:08     ` Marcelo Tosatti
  -1 siblings, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2013-01-23  0:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, H. Peter Anvin, x86, Rik van Riel

On Tue, Jan 22, 2013 at 01:24:35PM -0800, Dave Hansen wrote:
> 
> In short, it is illegal to call __pa() on an address holding
> a percpu variable.  This replaces those __pa() calls with
> slow_virt_to_phys().  All of the cases in this patch are
> in boot time (or CPU hotplug time at worst) code, so the
> slow pagetable walking in slow_virt_to_phys() is not expected
> to have a performance impact.
> 
> The times when this actually matters are pretty obscure
> (certain 32-bit NUMA systems), but it _does_ happen.  It is
> important to keep KVM guests working on these systems because
> the real hardware is getting harder and harder to find.
> 
> This bug manifested first by me seeing a plain hang at boot
> after this message:
> 
> 	CPU 0 irqstacks, hard=f3018000 soft=f301a000
> 
> or, sometimes, it would actually make it out to the console:
> 
> [    0.000000] BUG: unable to handle kernel paging request at ffffffff
> 
> I eventually traced it down to the KVM async pagefault code.
> This can be worked around by disabling that code either at
> compile-time, or on the kernel command-line.
> 
> The kvm async pagefault code was injecting page faults in
> to the guest which the guest misinterpreted because its
> "reason" was not being properly sent from the host.
> 
> The guest passes a physical address of an per-cpu async page
> fault structure via an MSR to the host.  Since __pa() is
> broken on percpu data, the physical address it sent was
> bascially bogus and the host went scribbling on random data.
> The guest never saw the real reason for the page fault (it
> was injected by the host), assumed that the kernel had taken
> a _real_ page fault, and panic()'d.  The behavior varied,
> though, depending on what got corrupted by the bad write.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
>  linux-2.6.git-dave/arch/x86/kernel/kvm.c      |    9 +++++----
>  linux-2.6.git-dave/arch/x86/kernel/kvmclock.c |    4 ++--
>  2 files changed, 7 insertions(+), 6 deletions(-)
 

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas
@ 2013-01-23  0:08     ` Marcelo Tosatti
  0 siblings, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2013-01-23  0:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, H. Peter Anvin, x86, Rik van Riel

On Tue, Jan 22, 2013 at 01:24:35PM -0800, Dave Hansen wrote:
> 
> In short, it is illegal to call __pa() on an address holding
> a percpu variable.  This replaces those __pa() calls with
> slow_virt_to_phys().  All of the cases in this patch are
> in boot time (or CPU hotplug time at worst) code, so the
> slow pagetable walking in slow_virt_to_phys() is not expected
> to have a performance impact.
> 
> The times when this actually matters are pretty obscure
> (certain 32-bit NUMA systems), but it _does_ happen.  It is
> important to keep KVM guests working on these systems because
> the real hardware is getting harder and harder to find.
> 
> This bug manifested first by me seeing a plain hang at boot
> after this message:
> 
> 	CPU 0 irqstacks, hard=f3018000 soft=f301a000
> 
> or, sometimes, it would actually make it out to the console:
> 
> [    0.000000] BUG: unable to handle kernel paging request at ffffffff
> 
> I eventually traced it down to the KVM async pagefault code.
> This can be worked around by disabling that code either at
> compile-time, or on the kernel command-line.
> 
> The kvm async pagefault code was injecting page faults in
> to the guest which the guest misinterpreted because its
> "reason" was not being properly sent from the host.
> 
> The guest passes a physical address of an per-cpu async page
> fault structure via an MSR to the host.  Since __pa() is
> broken on percpu data, the physical address it sent was
> bascially bogus and the host went scribbling on random data.
> The guest never saw the real reason for the page fault (it
> was injected by the host), assumed that the kernel had taken
> a _real_ page fault, and panic()'d.  The behavior varied,
> though, depending on what got corrupted by the bad write.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
>  linux-2.6.git-dave/arch/x86/kernel/kvm.c      |    9 +++++----
>  linux-2.6.git-dave/arch/x86/kernel/kvmclock.c |    4 ++--
>  2 files changed, 7 insertions(+), 6 deletions(-)
 

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
  2013-01-22 21:24 ` Dave Hansen
@ 2013-01-25 23:15   ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-25 23:15 UTC (permalink / raw)
  Cc: linux-kernel, linux-mm, Gleb Natapov, H.Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel

On 01/22/2013 01:24 PM, Dave Hansen wrote:
> This series fixes a hard-to-debug early boot hang on 32-bit
> NUMA systems.  It adds coverage to the debugging code,
> adds some helpers, and eventually fixes the original bug I
> was hitting.

I got one more reviewed-by on this set, but otherwise no more comments.
 Could these get pulled in to the x86 tree for merging in 3.9?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
@ 2013-01-25 23:15   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-25 23:15 UTC (permalink / raw)
  Cc: linux-kernel, linux-mm, Gleb Natapov, H.Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel

On 01/22/2013 01:24 PM, Dave Hansen wrote:
> This series fixes a hard-to-debug early boot hang on 32-bit
> NUMA systems.  It adds coverage to the debugging code,
> adds some helpers, and eventually fixes the original bug I
> was hitting.

I got one more reviewed-by on this set, but otherwise no more comments.
 Could these get pulled in to the x86 tree for merging in 3.9?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip:x86/mm] x86, mm: Make DEBUG_VIRTUAL work earlier in boot
  2013-01-22 21:24   ` Dave Hansen
  (?)
@ 2013-01-26  1:51   ` tip-bot for Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: tip-bot for Dave Hansen @ 2013-01-26  1:51 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, dave, tglx, hpa

Commit-ID:  a25b9316841c5afa226f8f70a457861b35276a92
Gitweb:     http://git.kernel.org/tip/a25b9316841c5afa226f8f70a457861b35276a92
Author:     Dave Hansen <dave@linux.vnet.ibm.com>
AuthorDate: Tue, 22 Jan 2013 13:24:30 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 25 Jan 2013 16:33:22 -0800

x86, mm: Make DEBUG_VIRTUAL work earlier in boot

The KVM code has some repeated bugs in it around use of __pa() on
per-cpu data.  Those data are not in an area on which using
__pa() is valid.  However, they are also called early enough in
boot that __vmalloc_start_set is not set, and thus the
CONFIG_DEBUG_VIRTUAL debugging does not catch them.

This adds a check to also verify __pa() calls against max_low_pfn,
which we can use earler in boot than is_vmalloc_addr().  However,
if we are super-early in boot, max_low_pfn=0 and this will trip
on every call, so also make sure that max_low_pfn is set before
we try to use it.

With this patch applied, CONFIG_DEBUG_VIRTUAL will actually
catch the bug I was chasing (and fix later in this series).

I'd love to find a generic way so that any __pa() call on percpu
areas could do a BUG_ON(), but there don't appear to be any nice
and easy ways to check if an address is a percpu one.  Anybody
have ideas on a way to do this?

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20130122212430.F46F8159@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/numa.c     | 2 +-
 arch/x86/mm/pat.c      | 4 ++--
 arch/x86/mm/physaddr.c | 9 ++++++++-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..76604eb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -219,7 +219,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 	 */
 	nd = alloc_remap(nid, nd_size);
 	if (nd) {
-		nd_pa = __pa(nd);
+		nd_pa = __phys_addr_nodebug(nd);
 		remapped = true;
 	} else {
 		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 0eb572e..2610bd9 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -560,10 +560,10 @@ int kernel_map_sync_memtype(u64 base, unsigned long size, unsigned long flags)
 {
 	unsigned long id_sz;
 
-	if (base >= __pa(high_memory))
+	if (base > __pa(high_memory-1))
 		return 0;
 
-	id_sz = (__pa(high_memory) < base + size) ?
+	id_sz = (__pa(high_memory-1) <= base + size) ?
 				__pa(high_memory) - base :
 				size;
 
diff --git a/arch/x86/mm/physaddr.c b/arch/x86/mm/physaddr.c
index c73fedd..e666cbb 100644
--- a/arch/x86/mm/physaddr.c
+++ b/arch/x86/mm/physaddr.c
@@ -1,3 +1,4 @@
+#include <linux/bootmem.h>
 #include <linux/mmdebug.h>
 #include <linux/module.h>
 #include <linux/mm.h>
@@ -68,10 +69,16 @@ EXPORT_SYMBOL(__virt_addr_valid);
 #ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
+	unsigned long phys_addr = x - PAGE_OFFSET;
 	/* VMALLOC_* aren't constants  */
 	VIRTUAL_BUG_ON(x < PAGE_OFFSET);
 	VIRTUAL_BUG_ON(__vmalloc_start_set && is_vmalloc_addr((void *) x));
-	return x - PAGE_OFFSET;
+	/* max_low_pfn is set early, but not _that_ early */
+	if (max_low_pfn) {
+		VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
+		BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
+	}
+	return phys_addr;
 }
 EXPORT_SYMBOL(__phys_addr);
 #endif

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip:x86/mm] x86, mm: Pagetable level size/shift/mask helpers
  2013-01-22 21:24   ` Dave Hansen
  (?)
@ 2013-01-26  1:52   ` tip-bot for Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: tip-bot for Dave Hansen @ 2013-01-26  1:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, dave, tglx, hpa

Commit-ID:  4cbeb51b860c57ba8b2ae50c4016ee7a41f5fbd5
Gitweb:     http://git.kernel.org/tip/4cbeb51b860c57ba8b2ae50c4016ee7a41f5fbd5
Author:     Dave Hansen <dave@linux.vnet.ibm.com>
AuthorDate: Tue, 22 Jan 2013 13:24:31 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 25 Jan 2013 16:33:22 -0800

x86, mm: Pagetable level size/shift/mask helpers

I plan to use lookup_address() to walk the kernel pagetables
in a later patch.  It returns a "pte" and the level in the
pagetables where the "pte" was found.  The level is just an
enum and needs to be converted to a useful value in order to
do address calculations with it.  These helpers will be used
in at least two places.

This also gives the anonymous enum a real name so that no one
gets confused about what they should be passing in to these
helpers.

"PTE_SHIFT" was chosen for naming consistency with the other
pagetable levels (PGD/PUD/PMD_SHIFT).

Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20130122212431.405D3A8C@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h       | 14 ++++++++++++++
 arch/x86/include/asm/pgtable_types.h |  2 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5199db2..bc28e6f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -390,6 +390,7 @@ pte_t *populate_extra_pte(unsigned long vaddr);
 
 #ifndef __ASSEMBLY__
 #include <linux/mm_types.h>
+#include <linux/log2.h>
 
 static inline int pte_none(pte_t pte)
 {
@@ -781,6 +782,19 @@ static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
        memcpy(dst, src, count * sizeof(pgd_t));
 }
 
+#define PTE_SHIFT ilog2(PTRS_PER_PTE)
+static inline int page_level_shift(enum pg_level level)
+{
+	return (PAGE_SHIFT - PTE_SHIFT) + level * PTE_SHIFT;
+}
+static inline unsigned long page_level_size(enum pg_level level)
+{
+	return 1UL << page_level_shift(level);
+}
+static inline unsigned long page_level_mask(enum pg_level level)
+{
+	return ~(page_level_size(level) - 1);
+}
 
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 3c32db8..6c297e7 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -331,7 +331,7 @@ extern void native_pagetable_init(void);
 struct seq_file;
 extern void arch_report_meminfo(struct seq_file *m);
 
-enum {
+enum pg_level {
 	PG_LEVEL_NONE,
 	PG_LEVEL_4K,
 	PG_LEVEL_2M,

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip:x86/mm] x86, mm: Use new pagetable helpers in try_preserve_large_page()
  2013-01-22 21:24   ` Dave Hansen
  (?)
@ 2013-01-26  1:53   ` tip-bot for Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: tip-bot for Dave Hansen @ 2013-01-26  1:53 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, dave, tglx, hpa

Commit-ID:  f3c4fbb68e93b10c781c0cc462a9d80770244da6
Gitweb:     http://git.kernel.org/tip/f3c4fbb68e93b10c781c0cc462a9d80770244da6
Author:     Dave Hansen <dave@linux.vnet.ibm.com>
AuthorDate: Tue, 22 Jan 2013 13:24:32 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 25 Jan 2013 16:33:23 -0800

x86, mm: Use new pagetable helpers in try_preserve_large_page()

try_preserve_large_page() can be slightly simplified by using
the new page_level_*() helpers.  This also moves the 'level'
over to the new pg_level enum type.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20130122212432.14F3D993@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/mm/pageattr.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 40f92f3..2a5c9ab 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -396,7 +396,7 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 	pte_t new_pte, old_pte, *tmp;
 	pgprot_t old_prot, new_prot, req_prot;
 	int i, do_split = 1;
-	unsigned int level;
+	enum pg_level level;
 
 	if (cpa->force_split)
 		return 1;
@@ -412,15 +412,12 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 
 	switch (level) {
 	case PG_LEVEL_2M:
-		psize = PMD_PAGE_SIZE;
-		pmask = PMD_PAGE_MASK;
-		break;
 #ifdef CONFIG_X86_64
 	case PG_LEVEL_1G:
-		psize = PUD_PAGE_SIZE;
-		pmask = PUD_PAGE_MASK;
-		break;
 #endif
+		psize = page_level_size(level);
+		pmask = page_level_mask(level);
+		break;
 	default:
 		do_split = -EINVAL;
 		goto out_unlock;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip:x86/mm] x86, mm: Create slow_virt_to_phys()
  2013-01-22 21:24   ` Dave Hansen
  (?)
@ 2013-01-26  1:54   ` tip-bot for Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: tip-bot for Dave Hansen @ 2013-01-26  1:54 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: riel, linux-kernel, hpa, mingo, dave, tglx, hpa

Commit-ID:  d765653445129b7c476758040e3079480775f80a
Gitweb:     http://git.kernel.org/tip/d765653445129b7c476758040e3079480775f80a
Author:     Dave Hansen <dave@linux.vnet.ibm.com>
AuthorDate: Tue, 22 Jan 2013 13:24:33 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 25 Jan 2013 16:33:23 -0800

x86, mm: Create slow_virt_to_phys()

This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems.  We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages.  We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).

In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area.  Any __pa() call on them will break
and basically return garbage.

This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory.  It should work on the normal linear mapping,
vmalloc(), kmap(), etc...

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20130122212433.4D1FCA62@kernel.stglabs.ibm.com
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/pgtable_types.h |  1 +
 arch/x86/mm/pageattr.c               | 31 +++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 6c297e7..9f82690 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);
 
 #endif	/* !__ASSEMBLY__ */
 
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 2a5c9ab..6d13d2a 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long address, unsigned int *level)
 EXPORT_SYMBOL_GPL(lookup_address);
 
 /*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems.  The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+	unsigned long virt_addr = (unsigned long)__virt_addr;
+	phys_addr_t phys_addr;
+	unsigned long offset;
+	enum pg_level level;
+	unsigned long psize;
+	unsigned long pmask;
+	pte_t *pte;
+
+	pte = lookup_address(virt_addr, &level);
+	BUG_ON(!pte);
+	psize = page_level_size(level);
+	pmask = page_level_mask(level);
+	offset = virt_addr & ~pmask;
+	phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+	return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
  * Set the new pmd in all the pgds we know about:
  */
 static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip:x86/mm] x86, kvm: Fix kvm's use of __pa() on percpu areas
  2013-01-22 21:24   ` Dave Hansen
  (?)
  (?)
@ 2013-01-26  1:56   ` tip-bot for Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: tip-bot for Dave Hansen @ 2013-01-26  1:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, linux-kernel, hpa, mingo, mtosatti, dave, tglx, hpa

Commit-ID:  5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d
Gitweb:     http://git.kernel.org/tip/5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d
Author:     Dave Hansen <dave@linux.vnet.ibm.com>
AuthorDate: Tue, 22 Jan 2013 13:24:35 -0800
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 25 Jan 2013 16:34:55 -0800

x86, kvm: Fix kvm's use of __pa() on percpu areas

In short, it is illegal to call __pa() on an address holding
a percpu variable.  This replaces those __pa() calls with
slow_virt_to_phys().  All of the cases in this patch are
in boot time (or CPU hotplug time at worst) code, so the
slow pagetable walking in slow_virt_to_phys() is not expected
to have a performance impact.

The times when this actually matters are pretty obscure
(certain 32-bit NUMA systems), but it _does_ happen.  It is
important to keep KVM guests working on these systems because
the real hardware is getting harder and harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

	CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[    0.000000] BUG: unable to handle kernel paging request at ffffffff

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host.  Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.  The behavior varied,
though, depending on what got corrupted by the bad write.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20130122212435.4905663F@kernel.stglabs.ibm.com
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/kvm.c      | 9 +++++----
 arch/x86/kernel/kvmclock.c | 4 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 9c2bd8b..aa7e58b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -297,9 +297,9 @@ static void kvm_register_steal_time(void)
 
 	memset(st, 0, sizeof(*st));
 
-	wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+	wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
 	printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
-		cpu, __pa(st));
+		cpu, slow_virt_to_phys(st));
 }
 
 static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
@@ -324,7 +324,7 @@ void __cpuinit kvm_guest_cpu_init(void)
 		return;
 
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
-		u64 pa = __pa(&__get_cpu_var(apf_reason));
+		u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
 
 #ifdef CONFIG_PREEMPT
 		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -340,7 +340,8 @@ void __cpuinit kvm_guest_cpu_init(void)
 		/* Size alignment is implied but just to make it explicit. */
 		BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
 		__get_cpu_var(kvm_apic_eoi) = 0;
-		pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
+		pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
+			| KVM_MSR_ENABLED;
 		wrmsrl(MSR_KVM_PV_EOI_EN, pa);
 	}
 
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 220a360..9f966dc 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
 	int low, high, ret;
 	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
 
-	low = (int)__pa(src) | 1;
-	high = ((u64)__pa(src) >> 32);
+	low = (int)slow_virt_to_phys(src) | 1;
+	high = ((u64)slow_virt_to_phys(src) >> 32);
 	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
 	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
 	       cpu, high, low, txt);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
  2013-01-22 21:24 ` Dave Hansen
@ 2013-02-24 21:28   ` Peter Hurley
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Hurley @ 2013-02-24 21:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, H. Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel


On Tue, 2013-01-22 at 13:24 -0800, Dave Hansen wrote:
> This series fixes a hard-to-debug early boot hang on 32-bit
> NUMA systems.  It adds coverage to the debugging code,
> adds some helpers, and eventually fixes the original bug I
> was hitting.

Hi Dave,

Now that the alloc_remap() has been/is being removed, is most/all of
this being reverted?

I ask because I was fixing a different bug in KVM's para-virt clock and
saw part of this series (from [PATCH 5/5] fix kvm's use of __pa() on
percpu areas):

diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c
--- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas   2013-01-22 13:17:16.428317508 -0800
+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c       2013-01-22 13:17:16.432317541 -0800
@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
        int low, high, ret;
        struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
 
-       low = (int)__pa(src) | 1;
-       high = ((u64)__pa(src) >> 32);
+       low = (int)slow_virt_to_phys(src) | 1;
+       high = ((u64)slow_virt_to_phys(src) >> 32);
        ret = native_write_msr_safe(msr_kvm_system_time, low, high);
        printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
               cpu, high, low, txt);

which confused me because hv_clock is the __va of allocated physical
memory, not a per-cpu variable.

	mem = memblock_alloc(size, PAGE_SIZE);
	if (!mem)
		return;
	hv_clock = __va(mem);

So in short, my questions are:
1) is the slow_virt_to_phys() necessary anymore?
2) if yes, does it apply to the code above?
3) if yes, would you explain in more detail what the 32-bit NUMA mm is
doing, esp. wrt. when __va(__pa) is not identical across all cpus?

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
@ 2013-02-24 21:28   ` Peter Hurley
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Hurley @ 2013-02-24 21:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, H. Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel


On Tue, 2013-01-22 at 13:24 -0800, Dave Hansen wrote:
> This series fixes a hard-to-debug early boot hang on 32-bit
> NUMA systems.  It adds coverage to the debugging code,
> adds some helpers, and eventually fixes the original bug I
> was hitting.

Hi Dave,

Now that the alloc_remap() has been/is being removed, is most/all of
this being reverted?

I ask because I was fixing a different bug in KVM's para-virt clock and
saw part of this series (from [PATCH 5/5] fix kvm's use of __pa() on
percpu areas):

diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c
--- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas   2013-01-22 13:17:16.428317508 -0800
+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c       2013-01-22 13:17:16.432317541 -0800
@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
        int low, high, ret;
        struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
 
-       low = (int)__pa(src) | 1;
-       high = ((u64)__pa(src) >> 32);
+       low = (int)slow_virt_to_phys(src) | 1;
+       high = ((u64)slow_virt_to_phys(src) >> 32);
        ret = native_write_msr_safe(msr_kvm_system_time, low, high);
        printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
               cpu, high, low, txt);

which confused me because hv_clock is the __va of allocated physical
memory, not a per-cpu variable.

	mem = memblock_alloc(size, PAGE_SIZE);
	if (!mem)
		return;
	hv_clock = __va(mem);

So in short, my questions are:
1) is the slow_virt_to_phys() necessary anymore?
2) if yes, does it apply to the code above?
3) if yes, would you explain in more detail what the 32-bit NUMA mm is
doing, esp. wrt. when __va(__pa) is not identical across all cpus?

Regards,
Peter Hurley


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
  2013-02-24 21:28   ` Peter Hurley
@ 2013-02-25  8:30     ` Gleb Natapov
  -1 siblings, 0 replies; 35+ messages in thread
From: Gleb Natapov @ 2013-02-25  8:30 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Dave Hansen, linux-kernel, linux-mm, H. Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel

On Sun, Feb 24, 2013 at 04:28:58PM -0500, Peter Hurley wrote:
> 
> On Tue, 2013-01-22 at 13:24 -0800, Dave Hansen wrote:
> > This series fixes a hard-to-debug early boot hang on 32-bit
> > NUMA systems.  It adds coverage to the debugging code,
> > adds some helpers, and eventually fixes the original bug I
> > was hitting.
> 
> Hi Dave,
> 
> Now that the alloc_remap() has been/is being removed, is most/all of
> this being reverted?
> 
> I ask because I was fixing a different bug in KVM's para-virt clock and
> saw part of this series (from [PATCH 5/5] fix kvm's use of __pa() on
> percpu areas):
> 
> diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c
> --- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas   2013-01-22 13:17:16.428317508 -0800
> +++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c       2013-01-22 13:17:16.432317541 -0800
> @@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
>         int low, high, ret;
>         struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
>  
> -       low = (int)__pa(src) | 1;
> -       high = ((u64)__pa(src) >> 32);
> +       low = (int)slow_virt_to_phys(src) | 1;
> +       high = ((u64)slow_virt_to_phys(src) >> 32);
>         ret = native_write_msr_safe(msr_kvm_system_time, low, high);
>         printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
>                cpu, high, low, txt);
> 
I think this hunk was converted by mistake since the code does not use
per cpu allocations, but since __pa() and slow_virt_to_phys() should
return the same thing here it does not matter much.

> which confused me because hv_clock is the __va of allocated physical
> memory, not a per-cpu variable.
> 
> 	mem = memblock_alloc(size, PAGE_SIZE);
> 	if (!mem)
> 		return;
> 	hv_clock = __va(mem);
> 
> So in short, my questions are:
> 1) is the slow_virt_to_phys() necessary anymore?
> 2) if yes, does it apply to the code above?
> 3) if yes, would you explain in more detail what the 32-bit NUMA mm is
> doing, esp. wrt. when __va(__pa) is not identical across all cpus?
> 
> Regards,
> Peter Hurley
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
@ 2013-02-25  8:30     ` Gleb Natapov
  0 siblings, 0 replies; 35+ messages in thread
From: Gleb Natapov @ 2013-02-25  8:30 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Dave Hansen, linux-kernel, linux-mm, H. Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel

On Sun, Feb 24, 2013 at 04:28:58PM -0500, Peter Hurley wrote:
> 
> On Tue, 2013-01-22 at 13:24 -0800, Dave Hansen wrote:
> > This series fixes a hard-to-debug early boot hang on 32-bit
> > NUMA systems.  It adds coverage to the debugging code,
> > adds some helpers, and eventually fixes the original bug I
> > was hitting.
> 
> Hi Dave,
> 
> Now that the alloc_remap() has been/is being removed, is most/all of
> this being reverted?
> 
> I ask because I was fixing a different bug in KVM's para-virt clock and
> saw part of this series (from [PATCH 5/5] fix kvm's use of __pa() on
> percpu areas):
> 
> diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c
> --- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas   2013-01-22 13:17:16.428317508 -0800
> +++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c       2013-01-22 13:17:16.432317541 -0800
> @@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
>         int low, high, ret;
>         struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
>  
> -       low = (int)__pa(src) | 1;
> -       high = ((u64)__pa(src) >> 32);
> +       low = (int)slow_virt_to_phys(src) | 1;
> +       high = ((u64)slow_virt_to_phys(src) >> 32);
>         ret = native_write_msr_safe(msr_kvm_system_time, low, high);
>         printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
>                cpu, high, low, txt);
> 
I think this hunk was converted by mistake since the code does not use
per cpu allocations, but since __pa() and slow_virt_to_phys() should
return the same thing here it does not matter much.

> which confused me because hv_clock is the __va of allocated physical
> memory, not a per-cpu variable.
> 
> 	mem = memblock_alloc(size, PAGE_SIZE);
> 	if (!mem)
> 		return;
> 	hv_clock = __va(mem);
> 
> So in short, my questions are:
> 1) is the slow_virt_to_phys() necessary anymore?
> 2) if yes, does it apply to the code above?
> 3) if yes, would you explain in more detail what the 32-bit NUMA mm is
> doing, esp. wrt. when __va(__pa) is not identical across all cpus?
> 
> Regards,
> Peter Hurley
> 

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
  2013-02-24 21:28   ` Peter Hurley
@ 2013-02-25 14:42     ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-02-25 14:42 UTC (permalink / raw)
  To: Peter Hurley
  Cc: linux-kernel, linux-mm, Gleb Natapov, H. Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel

On 02/24/2013 01:28 PM, Peter Hurley wrote:
> Now that the alloc_remap() has been/is being removed, is most/all of
> this being reverted?

I _believe_ alloc_remap() is the only case where we actually remapped
low memory.  However, there is still other code that does __pa()
translations for percpu areas: per_cpu_ptr_to_phys().  I _think_ it's
still theoretically possible to get some percpu data in the vmalloc() area.

> So in short, my questions are:
> 1) is the slow_virt_to_phys() necessary anymore?

kvm_vcpu_arch has a

        struct pvclock_vcpu_time_info hv_clock;

and I believe I mistook the two 'hv_clock's for each other.  However,
this doesn't hurt anything, and the performance difference is probably
horribly tiny.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
@ 2013-02-25 14:42     ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-02-25 14:42 UTC (permalink / raw)
  To: Peter Hurley
  Cc: linux-kernel, linux-mm, Gleb Natapov, H. Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel

On 02/24/2013 01:28 PM, Peter Hurley wrote:
> Now that the alloc_remap() has been/is being removed, is most/all of
> this being reverted?

I _believe_ alloc_remap() is the only case where we actually remapped
low memory.  However, there is still other code that does __pa()
translations for percpu areas: per_cpu_ptr_to_phys().  I _think_ it's
still theoretically possible to get some percpu data in the vmalloc() area.

> So in short, my questions are:
> 1) is the slow_virt_to_phys() necessary anymore?

kvm_vcpu_arch has a

        struct pvclock_vcpu_time_info hv_clock;

and I believe I mistook the two 'hv_clock's for each other.  However,
this doesn't hurt anything, and the performance difference is probably
horribly tiny.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
  2013-02-25 14:42     ` Dave Hansen
@ 2013-02-26 12:45       ` Peter Hurley
  -1 siblings, 0 replies; 35+ messages in thread
From: Peter Hurley @ 2013-02-26 12:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, H. Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel

On Mon, 2013-02-25 at 06:42 -0800, Dave Hansen wrote:
> On 02/24/2013 01:28 PM, Peter Hurley wrote:
> > Now that the alloc_remap() has been/is being removed, is most/all of
> > this being reverted?
> 
> I _believe_ alloc_remap() is the only case where we actually remapped
> low memory.  However, there is still other code that does __pa()
> translations for percpu areas: per_cpu_ptr_to_phys().  I _think_ it's
> still theoretically possible to get some percpu data in the vmalloc() area.
> 
> > So in short, my questions are:
> > 1) is the slow_virt_to_phys() necessary anymore?

Ah, yep. Thanks for pointing out per_cpu_ptr_to_phys().

> kvm_vcpu_arch has a
> 
>         struct pvclock_vcpu_time_info hv_clock;
> 
> and I believe I mistook the two 'hv_clock's for each other.  However,
> this doesn't hurt anything, and the performance difference is probably
> horribly tiny.

Ok. It was confusing because the fixmap of that same phys memblock done
by pvclock was broken and I couldn't understand why the hvclock memblock
needed to be looked-up per cpu. Mystery solved.

Regards,
Peter Hurley



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code
@ 2013-02-26 12:45       ` Peter Hurley
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Hurley @ 2013-02-26 12:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, H. Peter Anvin, x86,
	Marcelo Tosatti, Rik van Riel

On Mon, 2013-02-25 at 06:42 -0800, Dave Hansen wrote:
> On 02/24/2013 01:28 PM, Peter Hurley wrote:
> > Now that the alloc_remap() has been/is being removed, is most/all of
> > this being reverted?
> 
> I _believe_ alloc_remap() is the only case where we actually remapped
> low memory.  However, there is still other code that does __pa()
> translations for percpu areas: per_cpu_ptr_to_phys().  I _think_ it's
> still theoretically possible to get some percpu data in the vmalloc() area.
> 
> > So in short, my questions are:
> > 1) is the slow_virt_to_phys() necessary anymore?

Ah, yep. Thanks for pointing out per_cpu_ptr_to_phys().

> kvm_vcpu_arch has a
> 
>         struct pvclock_vcpu_time_info hv_clock;
> 
> and I believe I mistook the two 'hv_clock's for each other.  However,
> this doesn't hurt anything, and the performance difference is probably
> horribly tiny.

Ok. It was confusing because the fixmap of that same phys memblock done
by pvclock was broken and I couldn't understand why the hvclock memblock
needed to be looked-up per cpu. Mystery solved.

Regards,
Peter Hurley


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/5] create slow_virt_to_phys()
  2013-01-21 18:08     ` H. Peter Anvin
@ 2013-01-21 18:18       ` Dave Hansen
  -1 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-21 18:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, linux-mm, Gleb Natapov, x86, Marcelo Tosatti, Rik van Riel

On 01/21/2013 10:08 AM, H. Peter Anvin wrote:
> Why are you initializing psize/pmask?

It's an artifact from the switch() that was there before.  I'll clean it up.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/5] create slow_virt_to_phys()
@ 2013-01-21 18:18       ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-21 18:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, linux-mm, Gleb Natapov, x86, Marcelo Tosatti, Rik van Riel

On 01/21/2013 10:08 AM, H. Peter Anvin wrote:
> Why are you initializing psize/pmask?

It's an artifact from the switch() that was there before.  I'll clean it up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/5] create slow_virt_to_phys()
  2013-01-21 17:52   ` Dave Hansen
@ 2013-01-21 18:08     ` H. Peter Anvin
  -1 siblings, 0 replies; 35+ messages in thread
From: H. Peter Anvin @ 2013-01-21 18:08 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: linux-mm, Gleb Natapov, x86, Marcelo Tosatti, Rik van Riel

Why are you initializing psize/pmask?

Dave Hansen <dave@linux.vnet.ibm.com> wrote:

>
>This is necessary because __pa() does not work on some kinds of
>memory, like vmalloc() or the alloc_remap() areas on 32-bit
>NUMA systems.  We have some functions to do conversions _like_
>this in the vmalloc() code (like vmalloc_to_page()), but they
>do not work on sizes other than 4k pages.  We would potentially
>need to be able to handle all the page sizes that we use for
>the kernel linear mapping (4k, 2M, 1G).
>
>In practice, on 32-bit NUMA systems, the percpu areas get stuck
>in the alloc_remap() area.  Any __pa() call on them will break
>and basically return garbage.
>
>This patch introduces a new function slow_virt_to_phys(), which
>walks the kernel page tables on x86 and should do precisely
>the same logical thing as __pa(), but actually work on a wider
>range of memory.  It should work on the normal linear mapping,
>vmalloc(), kmap(), etc...
>
>Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
>Acked-by: Rik van Riel <riel@redhat.com>
>---
>
> linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    1 
>linux-2.6.git-dave/arch/x86/mm/pageattr.c               |   31
>++++++++++++++++
> 2 files changed, 32 insertions(+)
>
>diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys
>arch/x86/include/asm/pgtable_types.h
>---
>linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys	2013-01-17
>10:22:26.590434129 -0800
>+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-17
>10:22:26.598434199 -0800
>@@ -352,6 +352,7 @@ static inline void update_page_count(int
>  * as a pte too.
>  */
>extern pte_t *lookup_address(unsigned long address, unsigned int
>*level);
>+extern phys_addr_t slow_virt_to_phys(void *__address);
> 
> #endif	/* !__ASSEMBLY__ */
> 
>diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys
>arch/x86/mm/pageattr.c
>---
>linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys	2013-01-17
>10:22:26.594434163 -0800
>+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-17
>10:22:26.598434199 -0800
>@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr
> EXPORT_SYMBOL_GPL(lookup_address);
> 
> /*
>+ * This is necessary because __pa() does not work on some
>+ * kinds of memory, like vmalloc() or the alloc_remap()
>+ * areas on 32-bit NUMA systems.  The percpu areas can
>+ * end up in this kind of memory, for instance.
>+ *
>+ * This could be optimized, but it is only intended to be
>+ * used at inititalization time, and keeping it
>+ * unoptimized should increase the testing coverage for
>+ * the more obscure platforms.
>+ */
>+phys_addr_t slow_virt_to_phys(void *__virt_addr)
>+{
>+	unsigned long virt_addr = (unsigned long)__virt_addr;
>+	phys_addr_t phys_addr;
>+	unsigned long offset;
>+	unsigned int level = -1;
>+	unsigned long psize = 0;
>+	unsigned long pmask = 0;
>+	pte_t *pte;
>+
>+	pte = lookup_address(virt_addr, &level);
>+	BUG_ON(!pte);
>+	psize = page_level_size(level);
>+	pmask = page_level_mask(level);
>+	offset = virt_addr & ~pmask;
>+	phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
>+	return (phys_addr | offset);
>+}
>+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
>+
>+/*
>  * Set the new pmd in all the pgds we know about:
>  */
>static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t
>pte)
>_

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/5] create slow_virt_to_phys()
@ 2013-01-21 18:08     ` H. Peter Anvin
  0 siblings, 0 replies; 35+ messages in thread
From: H. Peter Anvin @ 2013-01-21 18:08 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: linux-mm, Gleb Natapov, x86, Marcelo Tosatti, Rik van Riel

Why are you initializing psize/pmask?

Dave Hansen <dave@linux.vnet.ibm.com> wrote:

>
>This is necessary because __pa() does not work on some kinds of
>memory, like vmalloc() or the alloc_remap() areas on 32-bit
>NUMA systems.  We have some functions to do conversions _like_
>this in the vmalloc() code (like vmalloc_to_page()), but they
>do not work on sizes other than 4k pages.  We would potentially
>need to be able to handle all the page sizes that we use for
>the kernel linear mapping (4k, 2M, 1G).
>
>In practice, on 32-bit NUMA systems, the percpu areas get stuck
>in the alloc_remap() area.  Any __pa() call on them will break
>and basically return garbage.
>
>This patch introduces a new function slow_virt_to_phys(), which
>walks the kernel page tables on x86 and should do precisely
>the same logical thing as __pa(), but actually work on a wider
>range of memory.  It should work on the normal linear mapping,
>vmalloc(), kmap(), etc...
>
>Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
>Acked-by: Rik van Riel <riel@redhat.com>
>---
>
> linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    1 
>linux-2.6.git-dave/arch/x86/mm/pageattr.c               |   31
>++++++++++++++++
> 2 files changed, 32 insertions(+)
>
>diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys
>arch/x86/include/asm/pgtable_types.h
>---
>linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys	2013-01-17
>10:22:26.590434129 -0800
>+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-17
>10:22:26.598434199 -0800
>@@ -352,6 +352,7 @@ static inline void update_page_count(int
>  * as a pte too.
>  */
>extern pte_t *lookup_address(unsigned long address, unsigned int
>*level);
>+extern phys_addr_t slow_virt_to_phys(void *__address);
> 
> #endif	/* !__ASSEMBLY__ */
> 
>diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys
>arch/x86/mm/pageattr.c
>---
>linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys	2013-01-17
>10:22:26.594434163 -0800
>+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-17
>10:22:26.598434199 -0800
>@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr
> EXPORT_SYMBOL_GPL(lookup_address);
> 
> /*
>+ * This is necessary because __pa() does not work on some
>+ * kinds of memory, like vmalloc() or the alloc_remap()
>+ * areas on 32-bit NUMA systems.  The percpu areas can
>+ * end up in this kind of memory, for instance.
>+ *
>+ * This could be optimized, but it is only intended to be
>+ * used at inititalization time, and keeping it
>+ * unoptimized should increase the testing coverage for
>+ * the more obscure platforms.
>+ */
>+phys_addr_t slow_virt_to_phys(void *__virt_addr)
>+{
>+	unsigned long virt_addr = (unsigned long)__virt_addr;
>+	phys_addr_t phys_addr;
>+	unsigned long offset;
>+	unsigned int level = -1;
>+	unsigned long psize = 0;
>+	unsigned long pmask = 0;
>+	pte_t *pte;
>+
>+	pte = lookup_address(virt_addr, &level);
>+	BUG_ON(!pte);
>+	psize = page_level_size(level);
>+	pmask = page_level_mask(level);
>+	offset = virt_addr & ~pmask;
>+	phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
>+	return (phys_addr | offset);
>+}
>+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
>+
>+/*
>  * Set the new pmd in all the pgds we know about:
>  */
>static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t
>pte)
>_

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/5] create slow_virt_to_phys()
  2013-01-21 17:52 [PATCH 0/5] " Dave Hansen
@ 2013-01-21 17:52   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-21 17:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems.  We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages.  We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).

In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area.  Any __pa() call on them will break
and basically return garbage.

This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory.  It should work on the normal linear mapping,
vmalloc(), kmap(), etc...

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    1 
 linux-2.6.git-dave/arch/x86/mm/pageattr.c               |   31 ++++++++++++++++
 2 files changed, 32 insertions(+)

diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys	2013-01-17 10:22:26.590434129 -0800
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-17 10:22:26.598434199 -0800
@@ -352,6 +352,7 @@ static inline void update_page_count(int
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);
 
 #endif	/* !__ASSEMBLY__ */
 
diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys	2013-01-17 10:22:26.594434163 -0800
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-17 10:22:26.598434199 -0800
@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr
 EXPORT_SYMBOL_GPL(lookup_address);
 
 /*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems.  The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+	unsigned long virt_addr = (unsigned long)__virt_addr;
+	phys_addr_t phys_addr;
+	unsigned long offset;
+	unsigned int level = -1;
+	unsigned long psize = 0;
+	unsigned long pmask = 0;
+	pte_t *pte;
+
+	pte = lookup_address(virt_addr, &level);
+	BUG_ON(!pte);
+	psize = page_level_size(level);
+	pmask = page_level_mask(level);
+	offset = virt_addr & ~pmask;
+	phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+	return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
  * Set the new pmd in all the pgds we know about:
  */
 static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
_


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/5] create slow_virt_to_phys()
@ 2013-01-21 17:52   ` Dave Hansen
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2013-01-21 17:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, H. Peter Anvin, x86, Marcelo Tosatti,
	Rik van Riel, Dave Hansen


This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems.  We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages.  We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).

In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area.  Any __pa() call on them will break
and basically return garbage.

This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory.  It should work on the normal linear mapping,
vmalloc(), kmap(), etc...

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    1 
 linux-2.6.git-dave/arch/x86/mm/pageattr.c               |   31 ++++++++++++++++
 2 files changed, 32 insertions(+)

diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys	2013-01-17 10:22:26.590434129 -0800
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-17 10:22:26.598434199 -0800
@@ -352,6 +352,7 @@ static inline void update_page_count(int
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);
 
 #endif	/* !__ASSEMBLY__ */
 
diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys	2013-01-17 10:22:26.594434163 -0800
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-17 10:22:26.598434199 -0800
@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr
 EXPORT_SYMBOL_GPL(lookup_address);
 
 /*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems.  The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+	unsigned long virt_addr = (unsigned long)__virt_addr;
+	phys_addr_t phys_addr;
+	unsigned long offset;
+	unsigned int level = -1;
+	unsigned long psize = 0;
+	unsigned long pmask = 0;
+	pte_t *pte;
+
+	pte = lookup_address(virt_addr, &level);
+	BUG_ON(!pte);
+	psize = page_level_size(level);
+	pmask = page_level_mask(level);
+	offset = virt_addr & ~pmask;
+	phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+	return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
  * Set the new pmd in all the pgds we know about:
  */
 static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2013-02-26 12:45 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-22 21:24 [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code Dave Hansen
2013-01-22 21:24 ` Dave Hansen
2013-01-22 21:24 ` [PATCH 1/5] make DEBUG_VIRTUAL work earlier in boot Dave Hansen
2013-01-22 21:24   ` Dave Hansen
2013-01-26  1:51   ` [tip:x86/mm] x86, mm: Make " tip-bot for Dave Hansen
2013-01-22 21:24 ` [PATCH 2/5] pagetable level size/shift/mask helpers Dave Hansen
2013-01-22 21:24   ` Dave Hansen
2013-01-26  1:52   ` [tip:x86/mm] x86, mm: Pagetable " tip-bot for Dave Hansen
2013-01-22 21:24 ` [PATCH 3/5] use new pagetable helpers in try_preserve_large_page() Dave Hansen
2013-01-22 21:24   ` Dave Hansen
2013-01-26  1:53   ` [tip:x86/mm] x86, mm: Use " tip-bot for Dave Hansen
2013-01-22 21:24 ` [PATCH 4/5] create slow_virt_to_phys() Dave Hansen
2013-01-22 21:24   ` Dave Hansen
2013-01-26  1:54   ` [tip:x86/mm] x86, mm: Create slow_virt_to_phys() tip-bot for Dave Hansen
2013-01-22 21:24 ` [PATCH 5/5] fix kvm's use of __pa() on percpu areas Dave Hansen
2013-01-22 21:24   ` Dave Hansen
2013-01-23  0:08   ` Marcelo Tosatti
2013-01-23  0:08     ` Marcelo Tosatti
2013-01-26  1:56   ` [tip:x86/mm] x86, kvm: Fix " tip-bot for Dave Hansen
2013-01-25 23:15 ` [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code Dave Hansen
2013-01-25 23:15   ` Dave Hansen
2013-02-24 21:28 ` Peter Hurley
2013-02-24 21:28   ` Peter Hurley
2013-02-25  8:30   ` Gleb Natapov
2013-02-25  8:30     ` Gleb Natapov
2013-02-25 14:42   ` Dave Hansen
2013-02-25 14:42     ` Dave Hansen
2013-02-26 12:45     ` Peter Hurley
2013-02-26 12:45       ` Peter Hurley
  -- strict thread matches above, loose matches on Subject: below --
2013-01-21 17:52 [PATCH 0/5] " Dave Hansen
2013-01-21 17:52 ` [PATCH 4/5] create slow_virt_to_phys() Dave Hansen
2013-01-21 17:52   ` Dave Hansen
2013-01-21 18:08   ` H. Peter Anvin
2013-01-21 18:08     ` H. Peter Anvin
2013-01-21 18:18     ` Dave Hansen
2013-01-21 18:18       ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.