All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-09-29 14:08 ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

The first bunch of patches that prepare kernel to boot-time switching
between paging modes.

Please review and consider applying.

Andrey Ryabinin (1):
  x86/kasan: Use the same shadow offset for 4- and 5-level paging

Kirill A. Shutemov (5):
  mm/sparsemem: Allocate mem_section at runtime for SPARSEMEM_EXTREME
  mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
  x86/xen: Provide pre-built page tables only for XEN_PV and XEN_PVH
  x86/xen: Drop 5-level paging support code from XEN_PV code
  x86/boot/compressed/64: Detect and handle 5-level paging at boot-time

 Documentation/x86/x86_64/mm.txt             |   2 +-
 arch/x86/Kconfig                            |   1 -
 arch/x86/boot/compressed/head_64.S          |  26 ++++-
 arch/x86/include/asm/pgtable-3level_types.h |   1 +
 arch/x86/include/asm/pgtable_64_types.h     |   2 +
 arch/x86/kernel/Makefile                    |   3 +-
 arch/x86/kernel/head_64.S                   |  11 +-
 arch/x86/mm/kasan_init_64.c                 | 101 ++++++++++++++----
 arch/x86/xen/mmu_pv.c                       | 159 +++++++++++-----------------
 include/linux/mmzone.h                      |   6 +-
 mm/page_alloc.c                             |  10 ++
 mm/sparse.c                                 |  17 +--
 mm/zsmalloc.c                               |  13 +--
 13 files changed, 210 insertions(+), 142 deletions(-)

-- 
2.14.2

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-09-29 14:08 ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

The first bunch of patches that prepare kernel to boot-time switching
between paging modes.

Please review and consider applying.

Andrey Ryabinin (1):
  x86/kasan: Use the same shadow offset for 4- and 5-level paging

Kirill A. Shutemov (5):
  mm/sparsemem: Allocate mem_section at runtime for SPARSEMEM_EXTREME
  mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
  x86/xen: Provide pre-built page tables only for XEN_PV and XEN_PVH
  x86/xen: Drop 5-level paging support code from XEN_PV code
  x86/boot/compressed/64: Detect and handle 5-level paging at boot-time

 Documentation/x86/x86_64/mm.txt             |   2 +-
 arch/x86/Kconfig                            |   1 -
 arch/x86/boot/compressed/head_64.S          |  26 ++++-
 arch/x86/include/asm/pgtable-3level_types.h |   1 +
 arch/x86/include/asm/pgtable_64_types.h     |   2 +
 arch/x86/kernel/Makefile                    |   3 +-
 arch/x86/kernel/head_64.S                   |  11 +-
 arch/x86/mm/kasan_init_64.c                 | 101 ++++++++++++++----
 arch/x86/xen/mmu_pv.c                       | 159 +++++++++++-----------------
 include/linux/mmzone.h                      |   6 +-
 mm/page_alloc.c                             |  10 ++
 mm/sparse.c                                 |  17 +--
 mm/zsmalloc.c                               |  13 +--
 13 files changed, 210 insertions(+), 142 deletions(-)

-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/6] mm/sparsemem: Allocate mem_section at runtime for SPARSEMEM_EXTREME
  2017-09-29 14:08 ` Kirill A. Shutemov
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

Size of mem_section array depends on size of physical address space.

In preparation for boot-time switching between paging modes on x86-64
we need to make allocation of mem_section dynamic.

With CONFIG_NODE_SHIFT=10, mem_section size if 32k for 4-level paging
and 2M for 5-level paging mode.

The patch allocates the array on the first call to
sparse_memory_present_with_active_regions().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mmzone.h |  6 +++++-
 mm/page_alloc.c        | 10 ++++++++++
 mm/sparse.c            | 17 +++++++++++------
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 356a814e7c8e..a48b55fbb502 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1144,13 +1144,17 @@ struct mem_section {
 #define SECTION_ROOT_MASK	(SECTIONS_PER_ROOT - 1)
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
-extern struct mem_section *mem_section[NR_SECTION_ROOTS];
+extern struct mem_section **mem_section;
 #else
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
 static inline struct mem_section *__nr_to_section(unsigned long nr)
 {
+#ifdef CONFIG_SPARSEMEM_EXTREME
+	if (!mem_section)
+		return NULL;
+#endif
 	if (!mem_section[SECTION_NR_TO_ROOT(nr)])
 		return NULL;
 	return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..8034651b916e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5645,6 +5645,16 @@ void __init sparse_memory_present_with_active_regions(int nid)
 	unsigned long start_pfn, end_pfn;
 	int i, this_nid;
 
+#ifdef CONFIG_SPARSEMEM_EXTREME
+	if (!mem_section) {
+		unsigned long size, align;
+
+		size = sizeof(struct mem_section) * NR_SECTION_ROOTS;
+		align = 1 << (INTERNODE_CACHE_SHIFT);
+		mem_section = memblock_virt_alloc(size, align);
+	}
+#endif
+
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, &this_nid)
 		memory_present(this_nid, start_pfn, end_pfn);
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..b00a97398795 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -22,8 +22,7 @@
  * 1) mem_section	- memory sections, mem_map's for valid memory
  */
 #ifdef CONFIG_SPARSEMEM_EXTREME
-struct mem_section *mem_section[NR_SECTION_ROOTS]
-	____cacheline_internodealigned_in_smp;
+struct mem_section **mem_section;
 #else
 struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
 	____cacheline_internodealigned_in_smp;
@@ -100,7 +99,7 @@ static inline int sparse_index_init(unsigned long section_nr, int nid)
 int __section_nr(struct mem_section* ms)
 {
 	unsigned long root_nr;
-	struct mem_section* root;
+	struct mem_section *root = NULL;
 
 	for (root_nr = 0; root_nr < NR_SECTION_ROOTS; root_nr++) {
 		root = __nr_to_section(root_nr * SECTIONS_PER_ROOT);
@@ -111,7 +110,7 @@ int __section_nr(struct mem_section* ms)
 		     break;
 	}
 
-	VM_BUG_ON(root_nr == NR_SECTION_ROOTS);
+	VM_BUG_ON(!root);
 
 	return (root_nr * SECTIONS_PER_ROOT) + (ms - root);
 }
@@ -329,11 +328,17 @@ sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
 static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
 {
 	unsigned long usemap_snr, pgdat_snr;
-	static unsigned long old_usemap_snr = NR_MEM_SECTIONS;
-	static unsigned long old_pgdat_snr = NR_MEM_SECTIONS;
+	static unsigned long old_usemap_snr;
+	static unsigned long old_pgdat_snr;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	int usemap_nid;
 
+	/* First call */
+	if (!old_usemap_snr) {
+		old_usemap_snr = NR_MEM_SECTIONS;
+		old_pgdat_snr = NR_MEM_SECTIONS;
+	}
+
 	usemap_snr = pfn_to_section_nr(__pa(usemap) >> PAGE_SHIFT);
 	pgdat_snr = pfn_to_section_nr(__pa(pgdat) >> PAGE_SHIFT);
 	if (usemap_snr == pgdat_snr)
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 1/6] mm/sparsemem: Allocate mem_section at runtime for SPARSEMEM_EXTREME
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

Size of mem_section array depends on size of physical address space.

In preparation for boot-time switching between paging modes on x86-64
we need to make allocation of mem_section dynamic.

With CONFIG_NODE_SHIFT=10, mem_section size if 32k for 4-level paging
and 2M for 5-level paging mode.

The patch allocates the array on the first call to
sparse_memory_present_with_active_regions().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mmzone.h |  6 +++++-
 mm/page_alloc.c        | 10 ++++++++++
 mm/sparse.c            | 17 +++++++++++------
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 356a814e7c8e..a48b55fbb502 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1144,13 +1144,17 @@ struct mem_section {
 #define SECTION_ROOT_MASK	(SECTIONS_PER_ROOT - 1)
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
-extern struct mem_section *mem_section[NR_SECTION_ROOTS];
+extern struct mem_section **mem_section;
 #else
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
 static inline struct mem_section *__nr_to_section(unsigned long nr)
 {
+#ifdef CONFIG_SPARSEMEM_EXTREME
+	if (!mem_section)
+		return NULL;
+#endif
 	if (!mem_section[SECTION_NR_TO_ROOT(nr)])
 		return NULL;
 	return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..8034651b916e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5645,6 +5645,16 @@ void __init sparse_memory_present_with_active_regions(int nid)
 	unsigned long start_pfn, end_pfn;
 	int i, this_nid;
 
+#ifdef CONFIG_SPARSEMEM_EXTREME
+	if (!mem_section) {
+		unsigned long size, align;
+
+		size = sizeof(struct mem_section) * NR_SECTION_ROOTS;
+		align = 1 << (INTERNODE_CACHE_SHIFT);
+		mem_section = memblock_virt_alloc(size, align);
+	}
+#endif
+
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, &this_nid)
 		memory_present(this_nid, start_pfn, end_pfn);
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..b00a97398795 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -22,8 +22,7 @@
  * 1) mem_section	- memory sections, mem_map's for valid memory
  */
 #ifdef CONFIG_SPARSEMEM_EXTREME
-struct mem_section *mem_section[NR_SECTION_ROOTS]
-	____cacheline_internodealigned_in_smp;
+struct mem_section **mem_section;
 #else
 struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
 	____cacheline_internodealigned_in_smp;
@@ -100,7 +99,7 @@ static inline int sparse_index_init(unsigned long section_nr, int nid)
 int __section_nr(struct mem_section* ms)
 {
 	unsigned long root_nr;
-	struct mem_section* root;
+	struct mem_section *root = NULL;
 
 	for (root_nr = 0; root_nr < NR_SECTION_ROOTS; root_nr++) {
 		root = __nr_to_section(root_nr * SECTIONS_PER_ROOT);
@@ -111,7 +110,7 @@ int __section_nr(struct mem_section* ms)
 		     break;
 	}
 
-	VM_BUG_ON(root_nr == NR_SECTION_ROOTS);
+	VM_BUG_ON(!root);
 
 	return (root_nr * SECTIONS_PER_ROOT) + (ms - root);
 }
@@ -329,11 +328,17 @@ sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
 static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
 {
 	unsigned long usemap_snr, pgdat_snr;
-	static unsigned long old_usemap_snr = NR_MEM_SECTIONS;
-	static unsigned long old_pgdat_snr = NR_MEM_SECTIONS;
+	static unsigned long old_usemap_snr;
+	static unsigned long old_pgdat_snr;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	int usemap_nid;
 
+	/* First call */
+	if (!old_usemap_snr) {
+		old_usemap_snr = NR_MEM_SECTIONS;
+		old_pgdat_snr = NR_MEM_SECTIONS;
+	}
+
 	usemap_snr = pfn_to_section_nr(__pa(usemap) >> PAGE_SHIFT);
 	pgdat_snr = pfn_to_section_nr(__pa(pgdat) >> PAGE_SHIFT);
 	if (usemap_snr == pgdat_snr)
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
  2017-09-29 14:08 ` Kirill A. Shutemov
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov, Minchan Kim,
	Nitin Gupta, Sergey Senozhatsky

With boot-time switching between paging mode we will have variable
MAX_PHYSMEM_BITS.

Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
configuration to define zsmalloc data structures.

The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
It also suits well to handle PAE special case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
---
 arch/x86/include/asm/pgtable-3level_types.h |  1 +
 arch/x86/include/asm/pgtable_64_types.h     |  2 ++
 mm/zsmalloc.c                               | 13 +++++++------
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index b8a4341faafa..3fe1d107a875 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -43,5 +43,6 @@ typedef union {
  */
 #define PTRS_PER_PTE	512
 
+#define MAX_POSSIBLE_PHYSMEM_BITS	36
 
 #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 06470da156ba..39075df30b8a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
 #define P4D_SIZE	(_AC(1, UL) << P4D_SHIFT)
 #define P4D_MASK	(~(P4D_SIZE - 1))
 
+#define MAX_POSSIBLE_PHYSMEM_BITS	52
+
 #else /* CONFIG_X86_5LEVEL */
 
 /*
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 7c38e850a8fc..7bde01c55c90 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -82,18 +82,19 @@
  * This is made more complicated by various memory models and PAE.
  */
 
-#ifndef MAX_PHYSMEM_BITS
-#ifdef CONFIG_HIGHMEM64G
-#define MAX_PHYSMEM_BITS 36
-#else /* !CONFIG_HIGHMEM64G */
+#ifndef MAX_POSSIBLE_PHYSMEM_BITS
+#ifdef MAX_PHYSMEM_BITS
+#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
+#else
 /*
  * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
  * be PAGE_SHIFT
  */
-#define MAX_PHYSMEM_BITS BITS_PER_LONG
+#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
 #endif
 #endif
-#define _PFN_BITS		(MAX_PHYSMEM_BITS - PAGE_SHIFT)
+
+#define _PFN_BITS		(MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
 
 /*
  * Memory for allocating for handle keeps object position by
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov, Minchan Kim,
	Nitin Gupta, Sergey Senozhatsky

With boot-time switching between paging mode we will have variable
MAX_PHYSMEM_BITS.

Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
configuration to define zsmalloc data structures.

The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
It also suits well to handle PAE special case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
---
 arch/x86/include/asm/pgtable-3level_types.h |  1 +
 arch/x86/include/asm/pgtable_64_types.h     |  2 ++
 mm/zsmalloc.c                               | 13 +++++++------
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
index b8a4341faafa..3fe1d107a875 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -43,5 +43,6 @@ typedef union {
  */
 #define PTRS_PER_PTE	512
 
+#define MAX_POSSIBLE_PHYSMEM_BITS	36
 
 #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 06470da156ba..39075df30b8a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
 #define P4D_SIZE	(_AC(1, UL) << P4D_SHIFT)
 #define P4D_MASK	(~(P4D_SIZE - 1))
 
+#define MAX_POSSIBLE_PHYSMEM_BITS	52
+
 #else /* CONFIG_X86_5LEVEL */
 
 /*
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 7c38e850a8fc..7bde01c55c90 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -82,18 +82,19 @@
  * This is made more complicated by various memory models and PAE.
  */
 
-#ifndef MAX_PHYSMEM_BITS
-#ifdef CONFIG_HIGHMEM64G
-#define MAX_PHYSMEM_BITS 36
-#else /* !CONFIG_HIGHMEM64G */
+#ifndef MAX_POSSIBLE_PHYSMEM_BITS
+#ifdef MAX_PHYSMEM_BITS
+#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
+#else
 /*
  * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
  * be PAGE_SHIFT
  */
-#define MAX_PHYSMEM_BITS BITS_PER_LONG
+#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
 #endif
 #endif
-#define _PFN_BITS		(MAX_PHYSMEM_BITS - PAGE_SHIFT)
+
+#define _PFN_BITS		(MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
 
 /*
  * Memory for allocating for handle keeps object position by
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/6] x86/kasan: Use the same shadow offset for 4- and 5-level paging
  2017-09-29 14:08 ` Kirill A. Shutemov
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Andrey Ryabinin, Kirill A . Shutemov

From: Andrey Ryabinin <aryabinin@virtuozzo.com>

We are going to support boot-time switching between 4- and 5-level
paging. For KASAN it means we cannot have different KASAN_SHADOW_OFFSET
for different paging modes: the constant is passed to gcc to generate
code and cannot be changed at runtime.

This patch changes KASAN code to use 0xdffffc0000000000 as shadow offset
for both 4- and 5-level paging.

For 5-level paging it means that shadow memory region is not aligned to
PGD boundary anymore and we have to handle unaligned parts of the region
properly.

In addition, we have to exclude paravirt code from KASAN instrumentation
as we now use set_pgd() before KASAN is fully ready.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
[kirill.shutemov@linux.intel.com: clenaup, changelog message]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/x86_64/mm.txt |   2 +-
 arch/x86/Kconfig                |   1 -
 arch/x86/kernel/Makefile        |   3 +-
 arch/x86/mm/kasan_init_64.c     | 101 +++++++++++++++++++++++++++++++---------
 4 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index b0798e281aa6..3448e675b462 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -34,7 +34,7 @@ ff92000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space
 ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole
 ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB)
 ... unused hole ...
-ffd8000000000000 - fff7ffffffffffff (=53 bits) kasan shadow memory (8PB)
+ffdf000000000000 - fffffc0000000000 (=53 bits) kasan shadow memory (8PB)
 ... unused hole ...
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 64e99d3c5169..6a15297140ff 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -303,7 +303,6 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
 config KASAN_SHADOW_OFFSET
 	hex
 	depends on KASAN
-	default 0xdff8000000000000 if X86_5LEVEL
 	default 0xdffffc0000000000
 
 config HAVE_INTEL_TXT
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index fd0a7895b63f..a97a6b611531 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,7 +24,8 @@ endif
 KASAN_SANITIZE_head$(BITS).o				:= n
 KASAN_SANITIZE_dumpstack.o				:= n
 KASAN_SANITIZE_dumpstack_$(BITS).o			:= n
-KASAN_SANITIZE_stacktrace.o := n
+KASAN_SANITIZE_stacktrace.o				:= n
+KASAN_SANITIZE_paravirt.o				:= n
 
 OBJECT_FILES_NON_STANDARD_head_$(BITS).o		:= y
 OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o	:= y
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..fe5760db7b19 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,8 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+static p4d_t tmp_p4d_table[PTRS_PER_P4D] __initdata __aligned(PAGE_SIZE);
+
 static int __init map_range(struct range *range)
 {
 	unsigned long start;
@@ -30,8 +32,10 @@ static void __init clear_pgds(unsigned long start,
 			unsigned long end)
 {
 	pgd_t *pgd;
+	/* See comment in kasan_init() */
+	unsigned long pgd_end = end & PGDIR_MASK;
 
-	for (; start < end; start += PGDIR_SIZE) {
+	for (; start < pgd_end; start += PGDIR_SIZE) {
 		pgd = pgd_offset_k(start);
 		/*
 		 * With folded p4d, pgd_clear() is nop, use p4d_clear()
@@ -42,29 +46,61 @@ static void __init clear_pgds(unsigned long start,
 		else
 			pgd_clear(pgd);
 	}
+
+	pgd = pgd_offset_k(start);
+	for (; start < end; start += P4D_SIZE)
+		p4d_clear(p4d_offset(pgd, start));
+}
+
+static inline p4d_t *early_p4d_offset(pgd_t *pgd, unsigned long addr)
+{
+	unsigned long p4d;
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return (p4d_t *)pgd;
+
+	p4d = __pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK;
+	p4d += __START_KERNEL_map - phys_base;
+	return (p4d_t *)p4d + p4d_index(addr);
+}
+
+static void __init kasan_early_p4d_populate(pgd_t *pgd,
+		unsigned long addr,
+		unsigned long end)
+{
+	pgd_t pgd_entry;
+	p4d_t *p4d, p4d_entry;
+	unsigned long next;
+
+	if (pgd_none(*pgd)) {
+		pgd_entry = __pgd(_KERNPG_TABLE | __pa_nodebug(kasan_zero_p4d));
+		set_pgd(pgd, pgd_entry);
+	}
+
+	p4d = early_p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+
+		if (!p4d_none(*p4d))
+			continue;
+
+		p4d_entry = __p4d(_KERNPG_TABLE | __pa_nodebug(kasan_zero_pud));
+		set_p4d(p4d, p4d_entry);
+	} while (p4d++, addr = next, addr != end && p4d_none(*p4d));
 }
 
 static void __init kasan_map_early_shadow(pgd_t *pgd)
 {
-	int i;
-	unsigned long start = KASAN_SHADOW_START;
+	/* See comment in kasan_init() */
+	unsigned long addr = KASAN_SHADOW_START & PGDIR_MASK;
 	unsigned long end = KASAN_SHADOW_END;
+	unsigned long next;
 
-	for (i = pgd_index(start); start < end; i++) {
-		switch (CONFIG_PGTABLE_LEVELS) {
-		case 4:
-			pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud) |
-					_KERNPG_TABLE);
-			break;
-		case 5:
-			pgd[i] = __pgd(__pa_nodebug(kasan_zero_p4d) |
-					_KERNPG_TABLE);
-			break;
-		default:
-			BUILD_BUG();
-		}
-		start += PGDIR_SIZE;
-	}
+	pgd += pgd_index(addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		kasan_early_p4d_populate(pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
 }
 
 #ifdef CONFIG_KASAN_INLINE
@@ -101,7 +137,7 @@ void __init kasan_early_init(void)
 	for (i = 0; i < PTRS_PER_PUD; i++)
 		kasan_zero_pud[i] = __pud(pud_val);
 
-	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
+	for (i = 0; IS_ENABLED(CONFIG_X86_5LEVEL) && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
 	kasan_map_early_shadow(early_top_pgt);
@@ -117,12 +153,35 @@ void __init kasan_init(void)
 #endif
 
 	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+
+	/*
+	 * We use the same shadow offset for 4- and 5-level paging to
+	 * facilitate boot-time switching between paging modes.
+	 * As result in 5-level paging mode KASAN_SHADOW_START and
+	 * KASAN_SHADOW_END are not aligned to PGD boundary.
+	 *
+	 * KASAN_SHADOW_START doesn't share PGD with anything else.
+	 * We claim whole PGD entry to make things easier.
+	 *
+	 * KASAN_SHADOW_END lands in the last PGD entry and it collides with
+	 * bunch of things like kernel code, modules, EFI mapping, etc.
+	 * We need to take extra steps to not overwrite them.
+	 */
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		void *ptr;
+
+		ptr = (void *)pgd_page_vaddr(*pgd_offset_k(KASAN_SHADOW_END));
+		memcpy(tmp_p4d_table, (void *)ptr, sizeof(tmp_p4d_table));
+		set_pgd(&early_top_pgt[pgd_index(KASAN_SHADOW_END)],
+				__pgd(__pa(tmp_p4d_table) | _KERNPG_TABLE));
+	}
+
 	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
-	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
+	clear_pgds(KASAN_SHADOW_START & PGDIR_MASK, KASAN_SHADOW_END);
 
-	kasan_populate_zero_shadow((void *)KASAN_SHADOW_START,
+	kasan_populate_zero_shadow((void *)(KASAN_SHADOW_START & PGDIR_MASK),
 			kasan_mem_to_shadow((void *)PAGE_OFFSET));
 
 	for (i = 0; i < E820_MAX_ENTRIES; i++) {
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/6] x86/kasan: Use the same shadow offset for 4- and 5-level paging
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Andrey Ryabinin, Kirill A . Shutemov

From: Andrey Ryabinin <aryabinin@virtuozzo.com>

We are going to support boot-time switching between 4- and 5-level
paging. For KASAN it means we cannot have different KASAN_SHADOW_OFFSET
for different paging modes: the constant is passed to gcc to generate
code and cannot be changed at runtime.

This patch changes KASAN code to use 0xdffffc0000000000 as shadow offset
for both 4- and 5-level paging.

For 5-level paging it means that shadow memory region is not aligned to
PGD boundary anymore and we have to handle unaligned parts of the region
properly.

In addition, we have to exclude paravirt code from KASAN instrumentation
as we now use set_pgd() before KASAN is fully ready.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
[kirill.shutemov@linux.intel.com: clenaup, changelog message]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/x86/x86_64/mm.txt |   2 +-
 arch/x86/Kconfig                |   1 -
 arch/x86/kernel/Makefile        |   3 +-
 arch/x86/mm/kasan_init_64.c     | 101 +++++++++++++++++++++++++++++++---------
 4 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index b0798e281aa6..3448e675b462 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -34,7 +34,7 @@ ff92000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space
 ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole
 ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB)
 ... unused hole ...
-ffd8000000000000 - fff7ffffffffffff (=53 bits) kasan shadow memory (8PB)
+ffdf000000000000 - fffffc0000000000 (=53 bits) kasan shadow memory (8PB)
 ... unused hole ...
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 64e99d3c5169..6a15297140ff 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -303,7 +303,6 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
 config KASAN_SHADOW_OFFSET
 	hex
 	depends on KASAN
-	default 0xdff8000000000000 if X86_5LEVEL
 	default 0xdffffc0000000000
 
 config HAVE_INTEL_TXT
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index fd0a7895b63f..a97a6b611531 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,7 +24,8 @@ endif
 KASAN_SANITIZE_head$(BITS).o				:= n
 KASAN_SANITIZE_dumpstack.o				:= n
 KASAN_SANITIZE_dumpstack_$(BITS).o			:= n
-KASAN_SANITIZE_stacktrace.o := n
+KASAN_SANITIZE_stacktrace.o				:= n
+KASAN_SANITIZE_paravirt.o				:= n
 
 OBJECT_FILES_NON_STANDARD_head_$(BITS).o		:= y
 OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o	:= y
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..fe5760db7b19 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,8 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+static p4d_t tmp_p4d_table[PTRS_PER_P4D] __initdata __aligned(PAGE_SIZE);
+
 static int __init map_range(struct range *range)
 {
 	unsigned long start;
@@ -30,8 +32,10 @@ static void __init clear_pgds(unsigned long start,
 			unsigned long end)
 {
 	pgd_t *pgd;
+	/* See comment in kasan_init() */
+	unsigned long pgd_end = end & PGDIR_MASK;
 
-	for (; start < end; start += PGDIR_SIZE) {
+	for (; start < pgd_end; start += PGDIR_SIZE) {
 		pgd = pgd_offset_k(start);
 		/*
 		 * With folded p4d, pgd_clear() is nop, use p4d_clear()
@@ -42,29 +46,61 @@ static void __init clear_pgds(unsigned long start,
 		else
 			pgd_clear(pgd);
 	}
+
+	pgd = pgd_offset_k(start);
+	for (; start < end; start += P4D_SIZE)
+		p4d_clear(p4d_offset(pgd, start));
+}
+
+static inline p4d_t *early_p4d_offset(pgd_t *pgd, unsigned long addr)
+{
+	unsigned long p4d;
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return (p4d_t *)pgd;
+
+	p4d = __pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK;
+	p4d += __START_KERNEL_map - phys_base;
+	return (p4d_t *)p4d + p4d_index(addr);
+}
+
+static void __init kasan_early_p4d_populate(pgd_t *pgd,
+		unsigned long addr,
+		unsigned long end)
+{
+	pgd_t pgd_entry;
+	p4d_t *p4d, p4d_entry;
+	unsigned long next;
+
+	if (pgd_none(*pgd)) {
+		pgd_entry = __pgd(_KERNPG_TABLE | __pa_nodebug(kasan_zero_p4d));
+		set_pgd(pgd, pgd_entry);
+	}
+
+	p4d = early_p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+
+		if (!p4d_none(*p4d))
+			continue;
+
+		p4d_entry = __p4d(_KERNPG_TABLE | __pa_nodebug(kasan_zero_pud));
+		set_p4d(p4d, p4d_entry);
+	} while (p4d++, addr = next, addr != end && p4d_none(*p4d));
 }
 
 static void __init kasan_map_early_shadow(pgd_t *pgd)
 {
-	int i;
-	unsigned long start = KASAN_SHADOW_START;
+	/* See comment in kasan_init() */
+	unsigned long addr = KASAN_SHADOW_START & PGDIR_MASK;
 	unsigned long end = KASAN_SHADOW_END;
+	unsigned long next;
 
-	for (i = pgd_index(start); start < end; i++) {
-		switch (CONFIG_PGTABLE_LEVELS) {
-		case 4:
-			pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud) |
-					_KERNPG_TABLE);
-			break;
-		case 5:
-			pgd[i] = __pgd(__pa_nodebug(kasan_zero_p4d) |
-					_KERNPG_TABLE);
-			break;
-		default:
-			BUILD_BUG();
-		}
-		start += PGDIR_SIZE;
-	}
+	pgd += pgd_index(addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		kasan_early_p4d_populate(pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
 }
 
 #ifdef CONFIG_KASAN_INLINE
@@ -101,7 +137,7 @@ void __init kasan_early_init(void)
 	for (i = 0; i < PTRS_PER_PUD; i++)
 		kasan_zero_pud[i] = __pud(pud_val);
 
-	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
+	for (i = 0; IS_ENABLED(CONFIG_X86_5LEVEL) && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
 	kasan_map_early_shadow(early_top_pgt);
@@ -117,12 +153,35 @@ void __init kasan_init(void)
 #endif
 
 	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+
+	/*
+	 * We use the same shadow offset for 4- and 5-level paging to
+	 * facilitate boot-time switching between paging modes.
+	 * As result in 5-level paging mode KASAN_SHADOW_START and
+	 * KASAN_SHADOW_END are not aligned to PGD boundary.
+	 *
+	 * KASAN_SHADOW_START doesn't share PGD with anything else.
+	 * We claim whole PGD entry to make things easier.
+	 *
+	 * KASAN_SHADOW_END lands in the last PGD entry and it collides with
+	 * bunch of things like kernel code, modules, EFI mapping, etc.
+	 * We need to take extra steps to not overwrite them.
+	 */
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		void *ptr;
+
+		ptr = (void *)pgd_page_vaddr(*pgd_offset_k(KASAN_SHADOW_END));
+		memcpy(tmp_p4d_table, (void *)ptr, sizeof(tmp_p4d_table));
+		set_pgd(&early_top_pgt[pgd_index(KASAN_SHADOW_END)],
+				__pgd(__pa(tmp_p4d_table) | _KERNPG_TABLE));
+	}
+
 	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
-	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
+	clear_pgds(KASAN_SHADOW_START & PGDIR_MASK, KASAN_SHADOW_END);
 
-	kasan_populate_zero_shadow((void *)KASAN_SHADOW_START,
+	kasan_populate_zero_shadow((void *)(KASAN_SHADOW_START & PGDIR_MASK),
 			kasan_mem_to_shadow((void *)PAGE_OFFSET));
 
 	for (i = 0; i < E820_MAX_ENTRIES; i++) {
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 4/6] x86/xen: Provide pre-built page tables only for XEN_PV and XEN_PVH
  2017-09-29 14:08 ` Kirill A. Shutemov
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

Looks like we only need pre-built page tables for XEN_PV and XEN_PVH
cases. Let's not provide them for other configurations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/kernel/head_64.S | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 513cbb012ecc..2be7d1e7fcf1 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,11 +37,12 @@
  *
  */
 
-#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
+#if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
 PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
 PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -361,10 +362,7 @@ NEXT_PAGE(early_dynamic_pgts)
 
 	.data
 
-#ifndef CONFIG_XEN
-NEXT_PAGE(init_top_pgt)
-	.fill	512,8,0
-#else
+#if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
@@ -381,6 +379,9 @@ NEXT_PAGE(level2_ident_pgt)
 	 * Don't set NX because code runs from these pages.
 	 */
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
+#else
+NEXT_PAGE(init_top_pgt)
+	.fill	512,8,0
 #endif
 
 #ifdef CONFIG_X86_5LEVEL
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 4/6] x86/xen: Provide pre-built page tables only for XEN_PV and XEN_PVH
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

Looks like we only need pre-built page tables for XEN_PV and XEN_PVH
cases. Let's not provide them for other configurations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/kernel/head_64.S | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 513cbb012ecc..2be7d1e7fcf1 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,11 +37,12 @@
  *
  */
 
-#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
+#if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
 PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
 PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -361,10 +362,7 @@ NEXT_PAGE(early_dynamic_pgts)
 
 	.data
 
-#ifndef CONFIG_XEN
-NEXT_PAGE(init_top_pgt)
-	.fill	512,8,0
-#else
+#if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
@@ -381,6 +379,9 @@ NEXT_PAGE(level2_ident_pgt)
 	 * Don't set NX because code runs from these pages.
 	 */
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
+#else
+NEXT_PAGE(init_top_pgt)
+	.fill	512,8,0
 #endif
 
 #ifdef CONFIG_X86_5LEVEL
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 5/6] x86/xen: Drop 5-level paging support code from XEN_PV code
  2017-09-29 14:08 ` Kirill A. Shutemov
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

It was decided 5-level paging is not going to be supported in XEN_PV.

Let's drop the dead code from XEN_PV code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/mmu_pv.c | 159 +++++++++++++++++++-------------------------------
 1 file changed, 60 insertions(+), 99 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 509f560bd0c6..5811815cc6ef 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -449,7 +449,7 @@ __visible pmd_t xen_make_pmd(pmdval_t pmd)
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_make_pmd);
 
-#if CONFIG_PGTABLE_LEVELS == 4
+#ifdef CONFIG_X86_64
 __visible pudval_t xen_pud_val(pud_t pud)
 {
 	return pte_mfn_to_pfn(pud.pud);
@@ -538,7 +538,7 @@ static void xen_set_p4d(p4d_t *ptr, p4d_t val)
 
 	xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
-#endif	/* CONFIG_PGTABLE_LEVELS == 4 */
+#endif	/* CONFIG_X86_64 */
 
 static int xen_pmd_walk(struct mm_struct *mm, pmd_t *pmd,
 		int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
@@ -580,21 +580,17 @@ static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
 		int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
 		bool last, unsigned long limit)
 {
-	int i, nr, flush = 0;
+	int flush = 0;
+	pud_t *pud;
 
-	nr = last ? p4d_index(limit) + 1 : PTRS_PER_P4D;
-	for (i = 0; i < nr; i++) {
-		pud_t *pud;
 
-		if (p4d_none(p4d[i]))
-			continue;
+	if (p4d_none(*p4d))
+		return flush;
 
-		pud = pud_offset(&p4d[i], 0);
-		if (PTRS_PER_PUD > 1)
-			flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
-		flush |= xen_pud_walk(mm, pud, func,
-				last && i == nr - 1, limit);
-	}
+	pud = pud_offset(p4d, 0);
+	if (PTRS_PER_PUD > 1)
+		flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
+	flush |= xen_pud_walk(mm, pud, func, last, limit);
 	return flush;
 }
 
@@ -644,8 +640,6 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
 			continue;
 
 		p4d = p4d_offset(&pgd[i], 0);
-		if (PTRS_PER_P4D > 1)
-			flush |= (*func)(mm, virt_to_page(p4d), PT_P4D);
 		flush |= xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
 	}
 
@@ -1176,22 +1170,14 @@ static void __init xen_cleanmfnmap(unsigned long vaddr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
-	unsigned int i;
 	bool unpin;
 
 	unpin = (vaddr == 2 * PGDIR_SIZE);
 	vaddr &= PMD_MASK;
 	pgd = pgd_offset_k(vaddr);
 	p4d = p4d_offset(pgd, 0);
-	for (i = 0; i < PTRS_PER_P4D; i++) {
-		if (p4d_none(p4d[i]))
-			continue;
-		xen_cleanmfnmap_p4d(p4d + i, unpin);
-	}
-	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
-		set_pgd(pgd, __pgd(0));
-		xen_cleanmfnmap_free_pgtbl(p4d, unpin);
-	}
+	if (!p4d_none(*p4d))
+		xen_cleanmfnmap_p4d(p4d, unpin);
 }
 
 static void __init xen_pagetable_p2m_free(void)
@@ -1697,7 +1683,7 @@ static void xen_release_pmd(unsigned long pfn)
 	xen_release_ptpage(pfn, PT_PMD);
 }
 
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 static void xen_alloc_pud(struct mm_struct *mm, unsigned long pfn)
 {
 	xen_alloc_ptpage(mm, pfn, PT_PUD);
@@ -2034,13 +2020,12 @@ static phys_addr_t __init xen_early_virt_to_phys(unsigned long vaddr)
  */
 void __init xen_relocate_p2m(void)
 {
-	phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys, p4d_phys;
+	phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys;
 	unsigned long p2m_pfn, p2m_pfn_end, n_frames, pfn, pfn_end;
-	int n_pte, n_pt, n_pmd, n_pud, n_p4d, idx_pte, idx_pt, idx_pmd, idx_pud, idx_p4d;
+	int n_pte, n_pt, n_pmd, n_pud, idx_pte, idx_pt, idx_pmd, idx_pud;
 	pte_t *pt;
 	pmd_t *pmd;
 	pud_t *pud;
-	p4d_t *p4d = NULL;
 	pgd_t *pgd;
 	unsigned long *new_p2m;
 	int save_pud;
@@ -2050,11 +2035,7 @@ void __init xen_relocate_p2m(void)
 	n_pt = roundup(size, PMD_SIZE) >> PMD_SHIFT;
 	n_pmd = roundup(size, PUD_SIZE) >> PUD_SHIFT;
 	n_pud = roundup(size, P4D_SIZE) >> P4D_SHIFT;
-	if (PTRS_PER_P4D > 1)
-		n_p4d = roundup(size, PGDIR_SIZE) >> PGDIR_SHIFT;
-	else
-		n_p4d = 0;
-	n_frames = n_pte + n_pt + n_pmd + n_pud + n_p4d;
+	n_frames = n_pte + n_pt + n_pmd + n_pud;
 
 	new_area = xen_find_free_area(PFN_PHYS(n_frames));
 	if (!new_area) {
@@ -2070,76 +2051,56 @@ void __init xen_relocate_p2m(void)
 	 * To avoid any possible virtual address collision, just use
 	 * 2 * PUD_SIZE for the new area.
 	 */
-	p4d_phys = new_area;
-	pud_phys = p4d_phys + PFN_PHYS(n_p4d);
+	pud_phys = new_area;
 	pmd_phys = pud_phys + PFN_PHYS(n_pud);
 	pt_phys = pmd_phys + PFN_PHYS(n_pmd);
 	p2m_pfn = PFN_DOWN(pt_phys) + n_pt;
 
 	pgd = __va(read_cr3_pa());
 	new_p2m = (unsigned long *)(2 * PGDIR_SIZE);
-	idx_p4d = 0;
 	save_pud = n_pud;
-	do {
-		if (n_p4d > 0) {
-			p4d = early_memremap(p4d_phys, PAGE_SIZE);
-			clear_page(p4d);
-			n_pud = min(save_pud, PTRS_PER_P4D);
-		}
-		for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
-			pud = early_memremap(pud_phys, PAGE_SIZE);
-			clear_page(pud);
-			for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
-				 idx_pmd++) {
-				pmd = early_memremap(pmd_phys, PAGE_SIZE);
-				clear_page(pmd);
-				for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
-					 idx_pt++) {
-					pt = early_memremap(pt_phys, PAGE_SIZE);
-					clear_page(pt);
-					for (idx_pte = 0;
-						 idx_pte < min(n_pte, PTRS_PER_PTE);
-						 idx_pte++) {
-						set_pte(pt + idx_pte,
-								pfn_pte(p2m_pfn, PAGE_KERNEL));
-						p2m_pfn++;
-					}
-					n_pte -= PTRS_PER_PTE;
-					early_memunmap(pt, PAGE_SIZE);
-					make_lowmem_page_readonly(__va(pt_phys));
-					pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
-							PFN_DOWN(pt_phys));
-					set_pmd(pmd + idx_pt,
-							__pmd(_PAGE_TABLE | pt_phys));
-					pt_phys += PAGE_SIZE;
+	for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
+		pud = early_memremap(pud_phys, PAGE_SIZE);
+		clear_page(pud);
+		for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
+				idx_pmd++) {
+			pmd = early_memremap(pmd_phys, PAGE_SIZE);
+			clear_page(pmd);
+			for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
+					idx_pt++) {
+				pt = early_memremap(pt_phys, PAGE_SIZE);
+				clear_page(pt);
+				for (idx_pte = 0;
+						idx_pte < min(n_pte, PTRS_PER_PTE);
+						idx_pte++) {
+					set_pte(pt + idx_pte,
+							pfn_pte(p2m_pfn, PAGE_KERNEL));
+					p2m_pfn++;
 				}
-				n_pt -= PTRS_PER_PMD;
-				early_memunmap(pmd, PAGE_SIZE);
-				make_lowmem_page_readonly(__va(pmd_phys));
-				pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
-						PFN_DOWN(pmd_phys));
-				set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
-				pmd_phys += PAGE_SIZE;
+				n_pte -= PTRS_PER_PTE;
+				early_memunmap(pt, PAGE_SIZE);
+				make_lowmem_page_readonly(__va(pt_phys));
+				pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
+						PFN_DOWN(pt_phys));
+				set_pmd(pmd + idx_pt,
+						__pmd(_PAGE_TABLE | pt_phys));
+				pt_phys += PAGE_SIZE;
 			}
-			n_pmd -= PTRS_PER_PUD;
-			early_memunmap(pud, PAGE_SIZE);
-			make_lowmem_page_readonly(__va(pud_phys));
-			pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
-			if (n_p4d > 0)
-				set_p4d(p4d + idx_pud, __p4d(_PAGE_TABLE | pud_phys));
-			else
-				set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
-			pud_phys += PAGE_SIZE;
-		}
-		if (n_p4d > 0) {
-			save_pud -= PTRS_PER_P4D;
-			early_memunmap(p4d, PAGE_SIZE);
-			make_lowmem_page_readonly(__va(p4d_phys));
-			pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE, PFN_DOWN(p4d_phys));
-			set_pgd(pgd + 2 + idx_p4d, __pgd(_PAGE_TABLE | p4d_phys));
-			p4d_phys += PAGE_SIZE;
+			n_pt -= PTRS_PER_PMD;
+			early_memunmap(pmd, PAGE_SIZE);
+			make_lowmem_page_readonly(__va(pmd_phys));
+			pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
+					PFN_DOWN(pmd_phys));
+			set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
+			pmd_phys += PAGE_SIZE;
 		}
-	} while (++idx_p4d < n_p4d);
+		n_pmd -= PTRS_PER_PUD;
+		early_memunmap(pud, PAGE_SIZE);
+		make_lowmem_page_readonly(__va(pud_phys));
+		pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
+		set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
+		pud_phys += PAGE_SIZE;
+	}
 
 	/* Now copy the old p2m info to the new area. */
 	memcpy(new_p2m, xen_p2m_addr, size);
@@ -2366,7 +2327,7 @@ static void __init xen_post_allocator_init(void)
 	pv_mmu_ops.set_pte = xen_set_pte;
 	pv_mmu_ops.set_pmd = xen_set_pmd;
 	pv_mmu_ops.set_pud = xen_set_pud;
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	pv_mmu_ops.set_p4d = xen_set_p4d;
 #endif
 
@@ -2376,7 +2337,7 @@ static void __init xen_post_allocator_init(void)
 	pv_mmu_ops.alloc_pmd = xen_alloc_pmd;
 	pv_mmu_ops.release_pte = xen_release_pte;
 	pv_mmu_ops.release_pmd = xen_release_pmd;
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	pv_mmu_ops.alloc_pud = xen_alloc_pud;
 	pv_mmu_ops.release_pud = xen_release_pud;
 #endif
@@ -2440,14 +2401,14 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 	.make_pmd = PV_CALLEE_SAVE(xen_make_pmd),
 	.pmd_val = PV_CALLEE_SAVE(xen_pmd_val),
 
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	.pud_val = PV_CALLEE_SAVE(xen_pud_val),
 	.make_pud = PV_CALLEE_SAVE(xen_make_pud),
 	.set_p4d = xen_set_p4d_hyper,
 
 	.alloc_pud = xen_alloc_pmd_init,
 	.release_pud = xen_release_pmd_init,
-#endif	/* CONFIG_PGTABLE_LEVELS == 4 */
+#endif	/* CONFIG_X86_64 */
 
 	.activate_mm = xen_activate_mm,
 	.dup_mmap = xen_dup_mmap,
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 5/6] x86/xen: Drop 5-level paging support code from XEN_PV code
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

It was decided 5-level paging is not going to be supported in XEN_PV.

Let's drop the dead code from XEN_PV code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/mmu_pv.c | 159 +++++++++++++++++++-------------------------------
 1 file changed, 60 insertions(+), 99 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 509f560bd0c6..5811815cc6ef 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -449,7 +449,7 @@ __visible pmd_t xen_make_pmd(pmdval_t pmd)
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_make_pmd);
 
-#if CONFIG_PGTABLE_LEVELS == 4
+#ifdef CONFIG_X86_64
 __visible pudval_t xen_pud_val(pud_t pud)
 {
 	return pte_mfn_to_pfn(pud.pud);
@@ -538,7 +538,7 @@ static void xen_set_p4d(p4d_t *ptr, p4d_t val)
 
 	xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
-#endif	/* CONFIG_PGTABLE_LEVELS == 4 */
+#endif	/* CONFIG_X86_64 */
 
 static int xen_pmd_walk(struct mm_struct *mm, pmd_t *pmd,
 		int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
@@ -580,21 +580,17 @@ static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
 		int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
 		bool last, unsigned long limit)
 {
-	int i, nr, flush = 0;
+	int flush = 0;
+	pud_t *pud;
 
-	nr = last ? p4d_index(limit) + 1 : PTRS_PER_P4D;
-	for (i = 0; i < nr; i++) {
-		pud_t *pud;
 
-		if (p4d_none(p4d[i]))
-			continue;
+	if (p4d_none(*p4d))
+		return flush;
 
-		pud = pud_offset(&p4d[i], 0);
-		if (PTRS_PER_PUD > 1)
-			flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
-		flush |= xen_pud_walk(mm, pud, func,
-				last && i == nr - 1, limit);
-	}
+	pud = pud_offset(p4d, 0);
+	if (PTRS_PER_PUD > 1)
+		flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
+	flush |= xen_pud_walk(mm, pud, func, last, limit);
 	return flush;
 }
 
@@ -644,8 +640,6 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
 			continue;
 
 		p4d = p4d_offset(&pgd[i], 0);
-		if (PTRS_PER_P4D > 1)
-			flush |= (*func)(mm, virt_to_page(p4d), PT_P4D);
 		flush |= xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
 	}
 
@@ -1176,22 +1170,14 @@ static void __init xen_cleanmfnmap(unsigned long vaddr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
-	unsigned int i;
 	bool unpin;
 
 	unpin = (vaddr == 2 * PGDIR_SIZE);
 	vaddr &= PMD_MASK;
 	pgd = pgd_offset_k(vaddr);
 	p4d = p4d_offset(pgd, 0);
-	for (i = 0; i < PTRS_PER_P4D; i++) {
-		if (p4d_none(p4d[i]))
-			continue;
-		xen_cleanmfnmap_p4d(p4d + i, unpin);
-	}
-	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
-		set_pgd(pgd, __pgd(0));
-		xen_cleanmfnmap_free_pgtbl(p4d, unpin);
-	}
+	if (!p4d_none(*p4d))
+		xen_cleanmfnmap_p4d(p4d, unpin);
 }
 
 static void __init xen_pagetable_p2m_free(void)
@@ -1697,7 +1683,7 @@ static void xen_release_pmd(unsigned long pfn)
 	xen_release_ptpage(pfn, PT_PMD);
 }
 
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 static void xen_alloc_pud(struct mm_struct *mm, unsigned long pfn)
 {
 	xen_alloc_ptpage(mm, pfn, PT_PUD);
@@ -2034,13 +2020,12 @@ static phys_addr_t __init xen_early_virt_to_phys(unsigned long vaddr)
  */
 void __init xen_relocate_p2m(void)
 {
-	phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys, p4d_phys;
+	phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys;
 	unsigned long p2m_pfn, p2m_pfn_end, n_frames, pfn, pfn_end;
-	int n_pte, n_pt, n_pmd, n_pud, n_p4d, idx_pte, idx_pt, idx_pmd, idx_pud, idx_p4d;
+	int n_pte, n_pt, n_pmd, n_pud, idx_pte, idx_pt, idx_pmd, idx_pud;
 	pte_t *pt;
 	pmd_t *pmd;
 	pud_t *pud;
-	p4d_t *p4d = NULL;
 	pgd_t *pgd;
 	unsigned long *new_p2m;
 	int save_pud;
@@ -2050,11 +2035,7 @@ void __init xen_relocate_p2m(void)
 	n_pt = roundup(size, PMD_SIZE) >> PMD_SHIFT;
 	n_pmd = roundup(size, PUD_SIZE) >> PUD_SHIFT;
 	n_pud = roundup(size, P4D_SIZE) >> P4D_SHIFT;
-	if (PTRS_PER_P4D > 1)
-		n_p4d = roundup(size, PGDIR_SIZE) >> PGDIR_SHIFT;
-	else
-		n_p4d = 0;
-	n_frames = n_pte + n_pt + n_pmd + n_pud + n_p4d;
+	n_frames = n_pte + n_pt + n_pmd + n_pud;
 
 	new_area = xen_find_free_area(PFN_PHYS(n_frames));
 	if (!new_area) {
@@ -2070,76 +2051,56 @@ void __init xen_relocate_p2m(void)
 	 * To avoid any possible virtual address collision, just use
 	 * 2 * PUD_SIZE for the new area.
 	 */
-	p4d_phys = new_area;
-	pud_phys = p4d_phys + PFN_PHYS(n_p4d);
+	pud_phys = new_area;
 	pmd_phys = pud_phys + PFN_PHYS(n_pud);
 	pt_phys = pmd_phys + PFN_PHYS(n_pmd);
 	p2m_pfn = PFN_DOWN(pt_phys) + n_pt;
 
 	pgd = __va(read_cr3_pa());
 	new_p2m = (unsigned long *)(2 * PGDIR_SIZE);
-	idx_p4d = 0;
 	save_pud = n_pud;
-	do {
-		if (n_p4d > 0) {
-			p4d = early_memremap(p4d_phys, PAGE_SIZE);
-			clear_page(p4d);
-			n_pud = min(save_pud, PTRS_PER_P4D);
-		}
-		for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
-			pud = early_memremap(pud_phys, PAGE_SIZE);
-			clear_page(pud);
-			for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
-				 idx_pmd++) {
-				pmd = early_memremap(pmd_phys, PAGE_SIZE);
-				clear_page(pmd);
-				for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
-					 idx_pt++) {
-					pt = early_memremap(pt_phys, PAGE_SIZE);
-					clear_page(pt);
-					for (idx_pte = 0;
-						 idx_pte < min(n_pte, PTRS_PER_PTE);
-						 idx_pte++) {
-						set_pte(pt + idx_pte,
-								pfn_pte(p2m_pfn, PAGE_KERNEL));
-						p2m_pfn++;
-					}
-					n_pte -= PTRS_PER_PTE;
-					early_memunmap(pt, PAGE_SIZE);
-					make_lowmem_page_readonly(__va(pt_phys));
-					pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
-							PFN_DOWN(pt_phys));
-					set_pmd(pmd + idx_pt,
-							__pmd(_PAGE_TABLE | pt_phys));
-					pt_phys += PAGE_SIZE;
+	for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
+		pud = early_memremap(pud_phys, PAGE_SIZE);
+		clear_page(pud);
+		for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
+				idx_pmd++) {
+			pmd = early_memremap(pmd_phys, PAGE_SIZE);
+			clear_page(pmd);
+			for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
+					idx_pt++) {
+				pt = early_memremap(pt_phys, PAGE_SIZE);
+				clear_page(pt);
+				for (idx_pte = 0;
+						idx_pte < min(n_pte, PTRS_PER_PTE);
+						idx_pte++) {
+					set_pte(pt + idx_pte,
+							pfn_pte(p2m_pfn, PAGE_KERNEL));
+					p2m_pfn++;
 				}
-				n_pt -= PTRS_PER_PMD;
-				early_memunmap(pmd, PAGE_SIZE);
-				make_lowmem_page_readonly(__va(pmd_phys));
-				pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
-						PFN_DOWN(pmd_phys));
-				set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
-				pmd_phys += PAGE_SIZE;
+				n_pte -= PTRS_PER_PTE;
+				early_memunmap(pt, PAGE_SIZE);
+				make_lowmem_page_readonly(__va(pt_phys));
+				pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
+						PFN_DOWN(pt_phys));
+				set_pmd(pmd + idx_pt,
+						__pmd(_PAGE_TABLE | pt_phys));
+				pt_phys += PAGE_SIZE;
 			}
-			n_pmd -= PTRS_PER_PUD;
-			early_memunmap(pud, PAGE_SIZE);
-			make_lowmem_page_readonly(__va(pud_phys));
-			pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
-			if (n_p4d > 0)
-				set_p4d(p4d + idx_pud, __p4d(_PAGE_TABLE | pud_phys));
-			else
-				set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
-			pud_phys += PAGE_SIZE;
-		}
-		if (n_p4d > 0) {
-			save_pud -= PTRS_PER_P4D;
-			early_memunmap(p4d, PAGE_SIZE);
-			make_lowmem_page_readonly(__va(p4d_phys));
-			pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE, PFN_DOWN(p4d_phys));
-			set_pgd(pgd + 2 + idx_p4d, __pgd(_PAGE_TABLE | p4d_phys));
-			p4d_phys += PAGE_SIZE;
+			n_pt -= PTRS_PER_PMD;
+			early_memunmap(pmd, PAGE_SIZE);
+			make_lowmem_page_readonly(__va(pmd_phys));
+			pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
+					PFN_DOWN(pmd_phys));
+			set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
+			pmd_phys += PAGE_SIZE;
 		}
-	} while (++idx_p4d < n_p4d);
+		n_pmd -= PTRS_PER_PUD;
+		early_memunmap(pud, PAGE_SIZE);
+		make_lowmem_page_readonly(__va(pud_phys));
+		pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
+		set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
+		pud_phys += PAGE_SIZE;
+	}
 
 	/* Now copy the old p2m info to the new area. */
 	memcpy(new_p2m, xen_p2m_addr, size);
@@ -2366,7 +2327,7 @@ static void __init xen_post_allocator_init(void)
 	pv_mmu_ops.set_pte = xen_set_pte;
 	pv_mmu_ops.set_pmd = xen_set_pmd;
 	pv_mmu_ops.set_pud = xen_set_pud;
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	pv_mmu_ops.set_p4d = xen_set_p4d;
 #endif
 
@@ -2376,7 +2337,7 @@ static void __init xen_post_allocator_init(void)
 	pv_mmu_ops.alloc_pmd = xen_alloc_pmd;
 	pv_mmu_ops.release_pte = xen_release_pte;
 	pv_mmu_ops.release_pmd = xen_release_pmd;
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	pv_mmu_ops.alloc_pud = xen_alloc_pud;
 	pv_mmu_ops.release_pud = xen_release_pud;
 #endif
@@ -2440,14 +2401,14 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 	.make_pmd = PV_CALLEE_SAVE(xen_make_pmd),
 	.pmd_val = PV_CALLEE_SAVE(xen_pmd_val),
 
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	.pud_val = PV_CALLEE_SAVE(xen_pud_val),
 	.make_pud = PV_CALLEE_SAVE(xen_make_pud),
 	.set_p4d = xen_set_p4d_hyper,
 
 	.alloc_pud = xen_alloc_pmd_init,
 	.release_pud = xen_release_pmd_init,
-#endif	/* CONFIG_PGTABLE_LEVELS == 4 */
+#endif	/* CONFIG_X86_64 */
 
 	.activate_mm = xen_activate_mm,
 	.dup_mmap = xen_dup_mmap,
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 6/6] x86/boot/compressed/64: Detect and handle 5-level paging at boot-time
  2017-09-29 14:08 ` Kirill A. Shutemov
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

This patch prepare decompression code to boot-time switching between 4-
and 5-level paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index b4a5d284391c..cefe4958fda9 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -288,7 +288,29 @@ ENTRY(startup_64)
 	leaq	boot_stack_end(%rbx), %rsp
 
 #ifdef CONFIG_X86_5LEVEL
-	/* Check if 5-level paging has already enabled */
+	/* Preserve RBX across CPUID */
+	movq	%rbx, %r8
+
+	/* Check if leaf 7 is supported */
+	xorl	%eax, %eax
+	cpuid
+	cmpl	$7, %eax
+	jb	lvl5
+
+	/*
+	 * Check if LA57 is supported.
+	 * The feature is enumerated with CPUID.(EAX=07H, ECX=0):ECX[bit 16]
+	 */
+	movl	$7, %eax
+	xorl	%ecx, %ecx
+	cpuid
+	andl	$(1 << 16), %ecx
+	jz	lvl5
+
+	/* Restore RBX */
+	movq	%r8, %rbx
+
+	/* Check if 5-level paging has already been enabled */
 	movq	%cr4, %rax
 	testl	$X86_CR4_LA57, %eax
 	jnz	lvl5
@@ -327,6 +349,8 @@ ENTRY(startup_64)
 	pushq	%rax
 	lretq
 lvl5:
+	/* Restore RBX */
+	movq	%r8, %rbx
 #endif
 
 	/* Zero EFLAGS */
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 6/6] x86/boot/compressed/64: Detect and handle 5-level paging at boot-time
@ 2017-09-29 14:08   ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-09-29 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin
  Cc: Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel, Kirill A. Shutemov

This patch prepare decompression code to boot-time switching between 4-
and 5-level paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/boot/compressed/head_64.S | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index b4a5d284391c..cefe4958fda9 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -288,7 +288,29 @@ ENTRY(startup_64)
 	leaq	boot_stack_end(%rbx), %rsp
 
 #ifdef CONFIG_X86_5LEVEL
-	/* Check if 5-level paging has already enabled */
+	/* Preserve RBX across CPUID */
+	movq	%rbx, %r8
+
+	/* Check if leaf 7 is supported */
+	xorl	%eax, %eax
+	cpuid
+	cmpl	$7, %eax
+	jb	lvl5
+
+	/*
+	 * Check if LA57 is supported.
+	 * The feature is enumerated with CPUID.(EAX=07H, ECX=0):ECX[bit 16]
+	 */
+	movl	$7, %eax
+	xorl	%ecx, %ecx
+	cpuid
+	andl	$(1 << 16), %ecx
+	jz	lvl5
+
+	/* Restore RBX */
+	movq	%r8, %rbx
+
+	/* Check if 5-level paging has already been enabled */
 	movq	%cr4, %rax
 	testl	$X86_CR4_LA57, %eax
 	jnz	lvl5
@@ -327,6 +349,8 @@ ENTRY(startup_64)
 	pushq	%rax
 	lretq
 lvl5:
+	/* Restore RBX */
+	movq	%r8, %rbx
 #endif
 
 	/* Zero EFLAGS */
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-09-29 14:08 ` Kirill A. Shutemov
@ 2017-10-03  8:27   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-03  8:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, x86, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, linux-mm, linux-kernel

On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> The first bunch of patches that prepare kernel to boot-time switching
> between paging modes.
> 
> Please review and consider applying.

Ping?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-03  8:27   ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-03  8:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, x86, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, linux-mm, linux-kernel

On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> The first bunch of patches that prepare kernel to boot-time switching
> between paging modes.
> 
> Please review and consider applying.

Ping?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
  2017-09-29 14:08   ` Kirill A. Shutemov
@ 2017-10-14  0:00     ` Nitin Gupta
  -1 siblings, 0 replies; 76+ messages in thread
From: Nitin Gupta @ 2017-10-14  0:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, linux-mm, linux-kernel, Minchan Kim,
	Sergey Senozhatsky

On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> With boot-time switching between paging mode we will have variable
> MAX_PHYSMEM_BITS.
>
> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> configuration to define zsmalloc data structures.
>
> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> It also suits well to handle PAE special case.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Nitin Gupta <ngupta@vflare.org>
> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
> ---
>  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>  arch/x86/include/asm/pgtable_64_types.h     |  2 ++
>  mm/zsmalloc.c                               | 13 +++++++------
>  3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
> index b8a4341faafa..3fe1d107a875 100644
> --- a/arch/x86/include/asm/pgtable-3level_types.h
> +++ b/arch/x86/include/asm/pgtable-3level_types.h
> @@ -43,5 +43,6 @@ typedef union {
>   */
>  #define PTRS_PER_PTE   512
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS      36
>
>  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
> diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> index 06470da156ba..39075df30b8a 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>  #define P4D_SIZE       (_AC(1, UL) << P4D_SHIFT)
>  #define P4D_MASK       (~(P4D_SIZE - 1))
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS      52
> +
>  #else /* CONFIG_X86_5LEVEL */
>
>  /*
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 7c38e850a8fc..7bde01c55c90 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -82,18 +82,19 @@
>   * This is made more complicated by various memory models and PAE.
>   */
>
> -#ifndef MAX_PHYSMEM_BITS
> -#ifdef CONFIG_HIGHMEM64G
> -#define MAX_PHYSMEM_BITS 36
> -#else /* !CONFIG_HIGHMEM64G */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#else
>  /*
>   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
>   * be PAGE_SHIFT
>   */
> -#define MAX_PHYSMEM_BITS BITS_PER_LONG
> +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>  #endif
>  #endif
> -#define _PFN_BITS              (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +
> +#define _PFN_BITS              (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>


I think we can avoid using this new constant in zsmalloc.

The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
would remain 32 bytes.

So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
thus OBJ_INDEX_BITS = PAGE_SHIFT.

- Nitin

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
@ 2017-10-14  0:00     ` Nitin Gupta
  0 siblings, 0 replies; 76+ messages in thread
From: Nitin Gupta @ 2017-10-14  0:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Linus Torvalds, x86, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, linux-mm, linux-kernel, Minchan Kim,
	Sergey Senozhatsky

On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> With boot-time switching between paging mode we will have variable
> MAX_PHYSMEM_BITS.
>
> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> configuration to define zsmalloc data structures.
>
> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> It also suits well to handle PAE special case.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Nitin Gupta <ngupta@vflare.org>
> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
> ---
>  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>  arch/x86/include/asm/pgtable_64_types.h     |  2 ++
>  mm/zsmalloc.c                               | 13 +++++++------
>  3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
> index b8a4341faafa..3fe1d107a875 100644
> --- a/arch/x86/include/asm/pgtable-3level_types.h
> +++ b/arch/x86/include/asm/pgtable-3level_types.h
> @@ -43,5 +43,6 @@ typedef union {
>   */
>  #define PTRS_PER_PTE   512
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS      36
>
>  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
> diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> index 06470da156ba..39075df30b8a 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>  #define P4D_SIZE       (_AC(1, UL) << P4D_SHIFT)
>  #define P4D_MASK       (~(P4D_SIZE - 1))
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS      52
> +
>  #else /* CONFIG_X86_5LEVEL */
>
>  /*
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 7c38e850a8fc..7bde01c55c90 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -82,18 +82,19 @@
>   * This is made more complicated by various memory models and PAE.
>   */
>
> -#ifndef MAX_PHYSMEM_BITS
> -#ifdef CONFIG_HIGHMEM64G
> -#define MAX_PHYSMEM_BITS 36
> -#else /* !CONFIG_HIGHMEM64G */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#else
>  /*
>   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
>   * be PAGE_SHIFT
>   */
> -#define MAX_PHYSMEM_BITS BITS_PER_LONG
> +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>  #endif
>  #endif
> -#define _PFN_BITS              (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +
> +#define _PFN_BITS              (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>


I think we can avoid using this new constant in zsmalloc.

The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
would remain 32 bytes.

So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
thus OBJ_INDEX_BITS = PAGE_SHIFT.

- Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
  2017-10-14  0:00     ` Nitin Gupta
@ 2017-10-16 14:44       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-16 14:44 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel,
	Minchan Kim, Sergey Senozhatsky

On Fri, Oct 13, 2017 at 05:00:12PM -0700, Nitin Gupta wrote:
> On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > With boot-time switching between paging mode we will have variable
> > MAX_PHYSMEM_BITS.
> >
> > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> > configuration to define zsmalloc data structures.
> >
> > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> > It also suits well to handle PAE special case.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Nitin Gupta <ngupta@vflare.org>
> > Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
> > ---
> >  arch/x86/include/asm/pgtable-3level_types.h |  1 +
> >  arch/x86/include/asm/pgtable_64_types.h     |  2 ++
> >  mm/zsmalloc.c                               | 13 +++++++------
> >  3 files changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
> > index b8a4341faafa..3fe1d107a875 100644
> > --- a/arch/x86/include/asm/pgtable-3level_types.h
> > +++ b/arch/x86/include/asm/pgtable-3level_types.h
> > @@ -43,5 +43,6 @@ typedef union {
> >   */
> >  #define PTRS_PER_PTE   512
> >
> > +#define MAX_POSSIBLE_PHYSMEM_BITS      36
> >
> >  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
> > diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> > index 06470da156ba..39075df30b8a 100644
> > --- a/arch/x86/include/asm/pgtable_64_types.h
> > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
> >  #define P4D_SIZE       (_AC(1, UL) << P4D_SHIFT)
> >  #define P4D_MASK       (~(P4D_SIZE - 1))
> >
> > +#define MAX_POSSIBLE_PHYSMEM_BITS      52
> > +
> >  #else /* CONFIG_X86_5LEVEL */
> >
> >  /*
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 7c38e850a8fc..7bde01c55c90 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -82,18 +82,19 @@
> >   * This is made more complicated by various memory models and PAE.
> >   */
> >
> > -#ifndef MAX_PHYSMEM_BITS
> > -#ifdef CONFIG_HIGHMEM64G
> > -#define MAX_PHYSMEM_BITS 36
> > -#else /* !CONFIG_HIGHMEM64G */
> > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> > +#ifdef MAX_PHYSMEM_BITS
> > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> > +#else
> >  /*
> >   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
> >   * be PAGE_SHIFT
> >   */
> > -#define MAX_PHYSMEM_BITS BITS_PER_LONG
> > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
> >  #endif
> >  #endif
> > -#define _PFN_BITS              (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> > +
> > +#define _PFN_BITS              (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
> >
> 
> 
> I think we can avoid using this new constant in zsmalloc.
> 
> The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
> bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
> for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
> would remain 32 bytes.
> 
> So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
> thus OBJ_INDEX_BITS = PAGE_SHIFT.

As you understand the topic better than me, could you prepare the patch?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
@ 2017-10-16 14:44       ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-16 14:44 UTC (permalink / raw)
  To: Nitin Gupta
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel,
	Minchan Kim, Sergey Senozhatsky

On Fri, Oct 13, 2017 at 05:00:12PM -0700, Nitin Gupta wrote:
> On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > With boot-time switching between paging mode we will have variable
> > MAX_PHYSMEM_BITS.
> >
> > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> > configuration to define zsmalloc data structures.
> >
> > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> > It also suits well to handle PAE special case.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Nitin Gupta <ngupta@vflare.org>
> > Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
> > ---
> >  arch/x86/include/asm/pgtable-3level_types.h |  1 +
> >  arch/x86/include/asm/pgtable_64_types.h     |  2 ++
> >  mm/zsmalloc.c                               | 13 +++++++------
> >  3 files changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
> > index b8a4341faafa..3fe1d107a875 100644
> > --- a/arch/x86/include/asm/pgtable-3level_types.h
> > +++ b/arch/x86/include/asm/pgtable-3level_types.h
> > @@ -43,5 +43,6 @@ typedef union {
> >   */
> >  #define PTRS_PER_PTE   512
> >
> > +#define MAX_POSSIBLE_PHYSMEM_BITS      36
> >
> >  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
> > diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> > index 06470da156ba..39075df30b8a 100644
> > --- a/arch/x86/include/asm/pgtable_64_types.h
> > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
> >  #define P4D_SIZE       (_AC(1, UL) << P4D_SHIFT)
> >  #define P4D_MASK       (~(P4D_SIZE - 1))
> >
> > +#define MAX_POSSIBLE_PHYSMEM_BITS      52
> > +
> >  #else /* CONFIG_X86_5LEVEL */
> >
> >  /*
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 7c38e850a8fc..7bde01c55c90 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -82,18 +82,19 @@
> >   * This is made more complicated by various memory models and PAE.
> >   */
> >
> > -#ifndef MAX_PHYSMEM_BITS
> > -#ifdef CONFIG_HIGHMEM64G
> > -#define MAX_PHYSMEM_BITS 36
> > -#else /* !CONFIG_HIGHMEM64G */
> > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> > +#ifdef MAX_PHYSMEM_BITS
> > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> > +#else
> >  /*
> >   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
> >   * be PAGE_SHIFT
> >   */
> > -#define MAX_PHYSMEM_BITS BITS_PER_LONG
> > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
> >  #endif
> >  #endif
> > -#define _PFN_BITS              (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> > +
> > +#define _PFN_BITS              (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
> >
> 
> 
> I think we can avoid using this new constant in zsmalloc.
> 
> The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
> bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
> for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
> would remain 32 bytes.
> 
> So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
> thus OBJ_INDEX_BITS = PAGE_SHIFT.

As you understand the topic better than me, could you prepare the patch?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-03  8:27   ` Kirill A. Shutemov
@ 2017-10-17 15:42     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-17 15:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, x86, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > The first bunch of patches that prepare kernel to boot-time switching
> > between paging modes.
> > 
> > Please review and consider applying.
> 
> Ping?

Ingo, is there anything I can do to get review easier for you?

I hoped to get boot-time switching code into v4.15...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-17 15:42     ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-17 15:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, x86, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > The first bunch of patches that prepare kernel to boot-time switching
> > between paging modes.
> > 
> > Please review and consider applying.
> 
> Ping?

Ingo, is there anything I can do to get review easier for you?

I hoped to get boot-time switching code into v4.15...

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
  2017-10-16 14:44       ` Kirill A. Shutemov
@ 2017-10-18 23:39         ` Nitin Gupta
  -1 siblings, 0 replies; 76+ messages in thread
From: Nitin Gupta @ 2017-10-18 23:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel,
	Minchan Kim, Sergey Senozhatsky

On Mon, Oct 16, 2017 at 7:44 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Fri, Oct 13, 2017 at 05:00:12PM -0700, Nitin Gupta wrote:
>> On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > With boot-time switching between paging mode we will have variable
>> > MAX_PHYSMEM_BITS.
>> >
>> > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
>> > configuration to define zsmalloc data structures.
>> >
>> > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
>> > It also suits well to handle PAE special case.
>> >
>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > Cc: Minchan Kim <minchan@kernel.org>
>> > Cc: Nitin Gupta <ngupta@vflare.org>
>> > Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
>> > ---
>> >  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>> >  arch/x86/include/asm/pgtable_64_types.h     |  2 ++
>> >  mm/zsmalloc.c                               | 13 +++++++------
>> >  3 files changed, 10 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
>> > index b8a4341faafa..3fe1d107a875 100644
>> > --- a/arch/x86/include/asm/pgtable-3level_types.h
>> > +++ b/arch/x86/include/asm/pgtable-3level_types.h
>> > @@ -43,5 +43,6 @@ typedef union {
>> >   */
>> >  #define PTRS_PER_PTE   512
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS      36
>> >
>> >  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
>> > diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
>> > index 06470da156ba..39075df30b8a 100644
>> > --- a/arch/x86/include/asm/pgtable_64_types.h
>> > +++ b/arch/x86/include/asm/pgtable_64_types.h
>> > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>> >  #define P4D_SIZE       (_AC(1, UL) << P4D_SHIFT)
>> >  #define P4D_MASK       (~(P4D_SIZE - 1))
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS      52
>> > +
>> >  #else /* CONFIG_X86_5LEVEL */
>> >
>> >  /*
>> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> > index 7c38e850a8fc..7bde01c55c90 100644
>> > --- a/mm/zsmalloc.c
>> > +++ b/mm/zsmalloc.c
>> > @@ -82,18 +82,19 @@
>> >   * This is made more complicated by various memory models and PAE.
>> >   */
>> >
>> > -#ifndef MAX_PHYSMEM_BITS
>> > -#ifdef CONFIG_HIGHMEM64G
>> > -#define MAX_PHYSMEM_BITS 36
>> > -#else /* !CONFIG_HIGHMEM64G */
>> > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
>> > +#ifdef MAX_PHYSMEM_BITS
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
>> > +#else
>> >  /*
>> >   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
>> >   * be PAGE_SHIFT
>> >   */
>> > -#define MAX_PHYSMEM_BITS BITS_PER_LONG
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>> >  #endif
>> >  #endif
>> > -#define _PFN_BITS              (MAX_PHYSMEM_BITS - PAGE_SHIFT)
>> > +
>> > +#define _PFN_BITS              (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>> >
>>
>>
>> I think we can avoid using this new constant in zsmalloc.
>>
>> The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
>> bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
>> for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
>> would remain 32 bytes.
>>
>> So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
>> thus OBJ_INDEX_BITS = PAGE_SHIFT.
>
> As you understand the topic better than me, could you prepare the patch?
>


Actually no changes are necessary.

As long as physical address bits <= BITS_PER_LONG, then setting
_PFN_BITS to the most conservative value of BITS_PER_LONG is
fine. AFAIK, this condition does not hold on x86 PAE where PA
bits (36) > BITS_PER_LONG (32), so only that case need special
handling to make sure PFN bits are not lost when encoding
allocated object location in an unsigned long.

Thanks,
Nitin

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
@ 2017-10-18 23:39         ` Nitin Gupta
  0 siblings, 0 replies; 76+ messages in thread
From: Nitin Gupta @ 2017-10-18 23:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel,
	Minchan Kim, Sergey Senozhatsky

On Mon, Oct 16, 2017 at 7:44 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Fri, Oct 13, 2017 at 05:00:12PM -0700, Nitin Gupta wrote:
>> On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > With boot-time switching between paging mode we will have variable
>> > MAX_PHYSMEM_BITS.
>> >
>> > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
>> > configuration to define zsmalloc data structures.
>> >
>> > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
>> > It also suits well to handle PAE special case.
>> >
>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > Cc: Minchan Kim <minchan@kernel.org>
>> > Cc: Nitin Gupta <ngupta@vflare.org>
>> > Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
>> > ---
>> >  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>> >  arch/x86/include/asm/pgtable_64_types.h     |  2 ++
>> >  mm/zsmalloc.c                               | 13 +++++++------
>> >  3 files changed, 10 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h
>> > index b8a4341faafa..3fe1d107a875 100644
>> > --- a/arch/x86/include/asm/pgtable-3level_types.h
>> > +++ b/arch/x86/include/asm/pgtable-3level_types.h
>> > @@ -43,5 +43,6 @@ typedef union {
>> >   */
>> >  #define PTRS_PER_PTE   512
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS      36
>> >
>> >  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
>> > diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
>> > index 06470da156ba..39075df30b8a 100644
>> > --- a/arch/x86/include/asm/pgtable_64_types.h
>> > +++ b/arch/x86/include/asm/pgtable_64_types.h
>> > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>> >  #define P4D_SIZE       (_AC(1, UL) << P4D_SHIFT)
>> >  #define P4D_MASK       (~(P4D_SIZE - 1))
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS      52
>> > +
>> >  #else /* CONFIG_X86_5LEVEL */
>> >
>> >  /*
>> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> > index 7c38e850a8fc..7bde01c55c90 100644
>> > --- a/mm/zsmalloc.c
>> > +++ b/mm/zsmalloc.c
>> > @@ -82,18 +82,19 @@
>> >   * This is made more complicated by various memory models and PAE.
>> >   */
>> >
>> > -#ifndef MAX_PHYSMEM_BITS
>> > -#ifdef CONFIG_HIGHMEM64G
>> > -#define MAX_PHYSMEM_BITS 36
>> > -#else /* !CONFIG_HIGHMEM64G */
>> > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
>> > +#ifdef MAX_PHYSMEM_BITS
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
>> > +#else
>> >  /*
>> >   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
>> >   * be PAGE_SHIFT
>> >   */
>> > -#define MAX_PHYSMEM_BITS BITS_PER_LONG
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>> >  #endif
>> >  #endif
>> > -#define _PFN_BITS              (MAX_PHYSMEM_BITS - PAGE_SHIFT)
>> > +
>> > +#define _PFN_BITS              (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>> >
>>
>>
>> I think we can avoid using this new constant in zsmalloc.
>>
>> The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
>> bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
>> for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
>> would remain 32 bytes.
>>
>> So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
>> thus OBJ_INDEX_BITS = PAGE_SHIFT.
>
> As you understand the topic better than me, could you prepare the patch?
>


Actually no changes are necessary.

As long as physical address bits <= BITS_PER_LONG, then setting
_PFN_BITS to the most conservative value of BITS_PER_LONG is
fine. AFAIK, this condition does not hold on x86 PAE where PA
bits (36) > BITS_PER_LONG (32), so only that case need special
handling to make sure PFN bits are not lost when encoding
allocated object location in an unsigned long.

Thanks,
Nitin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-17 15:42     ` Kirill A. Shutemov
@ 2017-10-20  8:18       ` Ingo Molnar
  -1 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-20  8:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Kirill A. Shutemov, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > The first bunch of patches that prepare kernel to boot-time switching
> > > between paging modes.
> > > 
> > > Please review and consider applying.
> > 
> > Ping?
> 
> Ingo, is there anything I can do to get review easier for you?

Yeah, what is the conclusion on the sub-discussion of patch #2:

  [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

... do we want to skip it entirely and use the other 5 patches?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-20  8:18       ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-20  8:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ingo Molnar, Kirill A. Shutemov, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > The first bunch of patches that prepare kernel to boot-time switching
> > > between paging modes.
> > > 
> > > Please review and consider applying.
> > 
> > Ping?
> 
> Ingo, is there anything I can do to get review easier for you?

Yeah, what is the conclusion on the sub-discussion of patch #2:

  [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

... do we want to skip it entirely and use the other 5 patches?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-20  8:18       ` Ingo Molnar
@ 2017-10-20  9:41         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-20  9:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Fri, Oct 20, 2017 at 08:18:53AM +0000, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > between paging modes.
> > > > 
> > > > Please review and consider applying.
> > > 
> > > Ping?
> > 
> > Ingo, is there anything I can do to get review easier for you?
> 
> Yeah, what is the conclusion on the sub-discussion of patch #2:
> 
>   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> 
> ... do we want to skip it entirely and use the other 5 patches?

Yes, please. MAX_PHYSMEM_BITS not variable yet in this part of the series.

And I will post some version the patch in the next part, if it will be
required.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-20  9:41         ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-20  9:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Fri, Oct 20, 2017 at 08:18:53AM +0000, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > between paging modes.
> > > > 
> > > > Please review and consider applying.
> > > 
> > > Ping?
> > 
> > Ingo, is there anything I can do to get review easier for you?
> 
> Yeah, what is the conclusion on the sub-discussion of patch #2:
> 
>   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> 
> ... do we want to skip it entirely and use the other 5 patches?

Yes, please. MAX_PHYSMEM_BITS not variable yet in this part of the series.

And I will post some version the patch in the next part, if it will be
required.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-20  8:18       ` Ingo Molnar
@ 2017-10-20  9:49         ` Minchan Kim
  -1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2017-10-20  9:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Kirill A. Shutemov,
	Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin,
	Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel

Hi Ingo,

On Fri, Oct 20, 2017 at 10:18:53AM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > between paging modes.
> > > > 
> > > > Please review and consider applying.
> > > 
> > > Ping?
> > 
> > Ingo, is there anything I can do to get review easier for you?
> 
> Yeah, what is the conclusion on the sub-discussion of patch #2:
> 
>   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> 
> ... do we want to skip it entirely and use the other 5 patches?

Sorry for the too much late reply, Kirill.
Yes, you can skip it.

As Nitin said in that patch's thread, zsmalloc has assumed
PFN_BIT is (BITS_PER_LONG - PAGE_SHIFT) so it already covers
X86_5LEVEL well, I think.

In summary, there is no need to change it.
I hope it helps to merge this patchset series.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-20  9:49         ` Minchan Kim
  0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2017-10-20  9:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Kirill A. Shutemov,
	Linus Torvalds, x86, Thomas Gleixner, H. Peter Anvin,
	Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel

Hi Ingo,

On Fri, Oct 20, 2017 at 10:18:53AM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > between paging modes.
> > > > 
> > > > Please review and consider applying.
> > > 
> > > Ping?
> > 
> > Ingo, is there anything I can do to get review easier for you?
> 
> Yeah, what is the conclusion on the sub-discussion of patch #2:
> 
>   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> 
> ... do we want to skip it entirely and use the other 5 patches?

Sorry for the too much late reply, Kirill.
Yes, you can skip it.

As Nitin said in that patch's thread, zsmalloc has assumed
PFN_BIT is (BITS_PER_LONG - PAGE_SHIFT) so it already covers
X86_5LEVEL well, I think.

In summary, there is no need to change it.
I hope it helps to merge this patchset series.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-20  9:49         ` Minchan Kim
@ 2017-10-20 12:18           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-20 12:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Ingo Molnar, Ingo Molnar, Kirill A. Shutemov, Linus Torvalds,
	x86, Thomas Gleixner, H. Peter Anvin, Andrew Morton,
	Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov, linux-mm,
	linux-kernel

On Fri, Oct 20, 2017 at 02:49:13AM -0700, Minchan Kim wrote:
> Hi Ingo,
> 
> On Fri, Oct 20, 2017 at 10:18:53AM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > > between paging modes.
> > > > > 
> > > > > Please review and consider applying.
> > > > 
> > > > Ping?
> > > 
> > > Ingo, is there anything I can do to get review easier for you?
> > 
> > Yeah, what is the conclusion on the sub-discussion of patch #2:
> > 
> >   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> > 
> > ... do we want to skip it entirely and use the other 5 patches?
> 
> Sorry for the too much late reply, Kirill.
> Yes, you can skip it.
> 
> As Nitin said in that patch's thread, zsmalloc has assumed
> PFN_BIT is (BITS_PER_LONG - PAGE_SHIFT) so it already covers
> X86_5LEVEL well, I think.
> 
> In summary, there is no need to change it.
> I hope it helps to merge this patchset series.

Acctually, no, we need something.

The problem is that later in the series[1] we make MAX_PHYSMEM_BITS
dynamic. It's not a simple constant anymore.

But zsmalloc uses it to define _PFN_BIT, which, with few hoops, defines
ZS_SIZE_CLASSES. ZS_SIZE_CLASSES is used to specify size of a field in
'struct zs_pool' and build fails if it's not constant.

My patch addresses this, but there are more than one solution to the
problem.

Which way do you prefer to get it fixed?

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/commit/?h=la57/boot-switching/v8&id=57f669244fab9081a4343b59373ff43170ef328f

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-20 12:18           ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-20 12:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Ingo Molnar, Ingo Molnar, Kirill A. Shutemov, Linus Torvalds,
	x86, Thomas Gleixner, H. Peter Anvin, Andrew Morton,
	Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov, linux-mm,
	linux-kernel

On Fri, Oct 20, 2017 at 02:49:13AM -0700, Minchan Kim wrote:
> Hi Ingo,
> 
> On Fri, Oct 20, 2017 at 10:18:53AM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > > between paging modes.
> > > > > 
> > > > > Please review and consider applying.
> > > > 
> > > > Ping?
> > > 
> > > Ingo, is there anything I can do to get review easier for you?
> > 
> > Yeah, what is the conclusion on the sub-discussion of patch #2:
> > 
> >   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> > 
> > ... do we want to skip it entirely and use the other 5 patches?
> 
> Sorry for the too much late reply, Kirill.
> Yes, you can skip it.
> 
> As Nitin said in that patch's thread, zsmalloc has assumed
> PFN_BIT is (BITS_PER_LONG - PAGE_SHIFT) so it already covers
> X86_5LEVEL well, I think.
> 
> In summary, there is no need to change it.
> I hope it helps to merge this patchset series.

Acctually, no, we need something.

The problem is that later in the series[1] we make MAX_PHYSMEM_BITS
dynamic. It's not a simple constant anymore.

But zsmalloc uses it to define _PFN_BIT, which, with few hoops, defines
ZS_SIZE_CLASSES. ZS_SIZE_CLASSES is used to specify size of a field in
'struct zs_pool' and build fails if it's not constant.

My patch addresses this, but there are more than one solution to the
problem.

Which way do you prefer to get it fixed?

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git/commit/?h=la57/boot-switching/v8&id=57f669244fab9081a4343b59373ff43170ef328f

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-09-29 14:08   ` Kirill A. Shutemov
  (?)
@ 2017-10-20 12:27   ` tip-bot for Kirill A. Shutemov
  2017-11-02 12:31     ` Sudeep Holla
  -1 siblings, 1 reply; 76+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2017-10-20 12:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, tglx, kirill.shutemov, peterz, hpa, akpm, gorcunov,
	linux-kernel, bp, luto, torvalds

Commit-ID:  83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
Gitweb:     https://git.kernel.org/tip/83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Fri, 29 Sep 2017 17:08:16 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 20 Oct 2017 13:07:09 +0200

mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y

Size of the mem_section[] array depends on the size of the physical address space.

In preparation for boot-time switching between paging modes on x86-64
we need to make the allocation of mem_section[] dynamic, because otherwise
we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
for 4-level paging and 2MB for 5-level paging mode.

The patch allocates the array on the first call to sparse_memory_present_with_active_regions().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170929140821.37654-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mmzone.h |  6 +++++-
 mm/page_alloc.c        | 10 ++++++++++
 mm/sparse.c            | 17 +++++++++++------
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c8f8941..e796edf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1150,13 +1150,17 @@ struct mem_section {
 #define SECTION_ROOT_MASK	(SECTIONS_PER_ROOT - 1)
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
-extern struct mem_section *mem_section[NR_SECTION_ROOTS];
+extern struct mem_section **mem_section;
 #else
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
 static inline struct mem_section *__nr_to_section(unsigned long nr)
 {
+#ifdef CONFIG_SPARSEMEM_EXTREME
+	if (!mem_section)
+		return NULL;
+#endif
 	if (!mem_section[SECTION_NR_TO_ROOT(nr)])
 		return NULL;
 	return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c..8dfd13f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5646,6 +5646,16 @@ void __init sparse_memory_present_with_active_regions(int nid)
 	unsigned long start_pfn, end_pfn;
 	int i, this_nid;
 
+#ifdef CONFIG_SPARSEMEM_EXTREME
+	if (!mem_section) {
+		unsigned long size, align;
+
+		size = sizeof(struct mem_section) * NR_SECTION_ROOTS;
+		align = 1 << (INTERNODE_CACHE_SHIFT);
+		mem_section = memblock_virt_alloc(size, align);
+	}
+#endif
+
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, &this_nid)
 		memory_present(this_nid, start_pfn, end_pfn);
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6..b00a973 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -22,8 +22,7 @@
  * 1) mem_section	- memory sections, mem_map's for valid memory
  */
 #ifdef CONFIG_SPARSEMEM_EXTREME
-struct mem_section *mem_section[NR_SECTION_ROOTS]
-	____cacheline_internodealigned_in_smp;
+struct mem_section **mem_section;
 #else
 struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT]
 	____cacheline_internodealigned_in_smp;
@@ -100,7 +99,7 @@ static inline int sparse_index_init(unsigned long section_nr, int nid)
 int __section_nr(struct mem_section* ms)
 {
 	unsigned long root_nr;
-	struct mem_section* root;
+	struct mem_section *root = NULL;
 
 	for (root_nr = 0; root_nr < NR_SECTION_ROOTS; root_nr++) {
 		root = __nr_to_section(root_nr * SECTIONS_PER_ROOT);
@@ -111,7 +110,7 @@ int __section_nr(struct mem_section* ms)
 		     break;
 	}
 
-	VM_BUG_ON(root_nr == NR_SECTION_ROOTS);
+	VM_BUG_ON(!root);
 
 	return (root_nr * SECTIONS_PER_ROOT) + (ms - root);
 }
@@ -329,11 +328,17 @@ again:
 static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
 {
 	unsigned long usemap_snr, pgdat_snr;
-	static unsigned long old_usemap_snr = NR_MEM_SECTIONS;
-	static unsigned long old_pgdat_snr = NR_MEM_SECTIONS;
+	static unsigned long old_usemap_snr;
+	static unsigned long old_pgdat_snr;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	int usemap_nid;
 
+	/* First call */
+	if (!old_usemap_snr) {
+		old_usemap_snr = NR_MEM_SECTIONS;
+		old_pgdat_snr = NR_MEM_SECTIONS;
+	}
+
 	usemap_snr = pfn_to_section_nr(__pa(usemap) >> PAGE_SHIFT);
 	pgdat_snr = pfn_to_section_nr(__pa(pgdat) >> PAGE_SHIFT);
 	if (usemap_snr == pgdat_snr)

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/mm] x86/kasan: Use the same shadow offset for 4- and 5-level paging
  2017-09-29 14:08   ` Kirill A. Shutemov
  (?)
@ 2017-10-20 12:28   ` tip-bot for Andrey Ryabinin
  -1 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Andrey Ryabinin @ 2017-10-20 12:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: aryabinin, luto, mingo, tglx, hpa, gorcunov, akpm,
	kirill.shutemov, linux-kernel, peterz, torvalds, bp

Commit-ID:  12a8cc7fcf54a8575f094be1e99032ec38aa045c
Gitweb:     https://git.kernel.org/tip/12a8cc7fcf54a8575f094be1e99032ec38aa045c
Author:     Andrey Ryabinin <aryabinin@virtuozzo.com>
AuthorDate: Fri, 29 Sep 2017 17:08:18 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 20 Oct 2017 13:07:09 +0200

x86/kasan: Use the same shadow offset for 4- and 5-level paging

We are going to support boot-time switching between 4- and 5-level
paging. For KASAN it means we cannot have different KASAN_SHADOW_OFFSET
for different paging modes: the constant is passed to gcc to generate
code and cannot be changed at runtime.

This patch changes KASAN code to use 0xdffffc0000000000 as shadow offset
for both 4- and 5-level paging.

For 5-level paging it means that shadow memory region is not aligned to
PGD boundary anymore and we have to handle unaligned parts of the region
properly.

In addition, we have to exclude paravirt code from KASAN instrumentation
as we now use set_pgd() before KASAN is fully ready.

[kirill.shutemov@linux.intel.com: clenaup, changelog message]
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170929140821.37654-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/x86/x86_64/mm.txt |   2 +-
 arch/x86/Kconfig                |   1 -
 arch/x86/kernel/Makefile        |   3 +-
 arch/x86/mm/kasan_init_64.c     | 101 +++++++++++++++++++++++++++++++---------
 4 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index b0798e2..3448e67 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -34,7 +34,7 @@ ff92000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space
 ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole
 ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB)
 ... unused hole ...
-ffd8000000000000 - fff7ffffffffffff (=53 bits) kasan shadow memory (8PB)
+ffdf000000000000 - fffffc0000000000 (=53 bits) kasan shadow memory (8PB)
 ... unused hole ...
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac..32779be 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,7 +302,6 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
 config KASAN_SHADOW_OFFSET
 	hex
 	depends on KASAN
-	default 0xdff8000000000000 if X86_5LEVEL
 	default 0xdffffc0000000000
 
 config HAVE_INTEL_TXT
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index fd0a789..a97a6b6 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,7 +24,8 @@ endif
 KASAN_SANITIZE_head$(BITS).o				:= n
 KASAN_SANITIZE_dumpstack.o				:= n
 KASAN_SANITIZE_dumpstack_$(BITS).o			:= n
-KASAN_SANITIZE_stacktrace.o := n
+KASAN_SANITIZE_stacktrace.o				:= n
+KASAN_SANITIZE_paravirt.o				:= n
 
 OBJECT_FILES_NON_STANDARD_head_$(BITS).o		:= y
 OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o	:= y
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73..fe5760d 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,8 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+static p4d_t tmp_p4d_table[PTRS_PER_P4D] __initdata __aligned(PAGE_SIZE);
+
 static int __init map_range(struct range *range)
 {
 	unsigned long start;
@@ -30,8 +32,10 @@ static void __init clear_pgds(unsigned long start,
 			unsigned long end)
 {
 	pgd_t *pgd;
+	/* See comment in kasan_init() */
+	unsigned long pgd_end = end & PGDIR_MASK;
 
-	for (; start < end; start += PGDIR_SIZE) {
+	for (; start < pgd_end; start += PGDIR_SIZE) {
 		pgd = pgd_offset_k(start);
 		/*
 		 * With folded p4d, pgd_clear() is nop, use p4d_clear()
@@ -42,29 +46,61 @@ static void __init clear_pgds(unsigned long start,
 		else
 			pgd_clear(pgd);
 	}
+
+	pgd = pgd_offset_k(start);
+	for (; start < end; start += P4D_SIZE)
+		p4d_clear(p4d_offset(pgd, start));
+}
+
+static inline p4d_t *early_p4d_offset(pgd_t *pgd, unsigned long addr)
+{
+	unsigned long p4d;
+
+	if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+		return (p4d_t *)pgd;
+
+	p4d = __pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK;
+	p4d += __START_KERNEL_map - phys_base;
+	return (p4d_t *)p4d + p4d_index(addr);
+}
+
+static void __init kasan_early_p4d_populate(pgd_t *pgd,
+		unsigned long addr,
+		unsigned long end)
+{
+	pgd_t pgd_entry;
+	p4d_t *p4d, p4d_entry;
+	unsigned long next;
+
+	if (pgd_none(*pgd)) {
+		pgd_entry = __pgd(_KERNPG_TABLE | __pa_nodebug(kasan_zero_p4d));
+		set_pgd(pgd, pgd_entry);
+	}
+
+	p4d = early_p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+
+		if (!p4d_none(*p4d))
+			continue;
+
+		p4d_entry = __p4d(_KERNPG_TABLE | __pa_nodebug(kasan_zero_pud));
+		set_p4d(p4d, p4d_entry);
+	} while (p4d++, addr = next, addr != end && p4d_none(*p4d));
 }
 
 static void __init kasan_map_early_shadow(pgd_t *pgd)
 {
-	int i;
-	unsigned long start = KASAN_SHADOW_START;
+	/* See comment in kasan_init() */
+	unsigned long addr = KASAN_SHADOW_START & PGDIR_MASK;
 	unsigned long end = KASAN_SHADOW_END;
+	unsigned long next;
 
-	for (i = pgd_index(start); start < end; i++) {
-		switch (CONFIG_PGTABLE_LEVELS) {
-		case 4:
-			pgd[i] = __pgd(__pa_nodebug(kasan_zero_pud) |
-					_KERNPG_TABLE);
-			break;
-		case 5:
-			pgd[i] = __pgd(__pa_nodebug(kasan_zero_p4d) |
-					_KERNPG_TABLE);
-			break;
-		default:
-			BUILD_BUG();
-		}
-		start += PGDIR_SIZE;
-	}
+	pgd += pgd_index(addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		kasan_early_p4d_populate(pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
 }
 
 #ifdef CONFIG_KASAN_INLINE
@@ -101,7 +137,7 @@ void __init kasan_early_init(void)
 	for (i = 0; i < PTRS_PER_PUD; i++)
 		kasan_zero_pud[i] = __pud(pud_val);
 
-	for (i = 0; CONFIG_PGTABLE_LEVELS >= 5 && i < PTRS_PER_P4D; i++)
+	for (i = 0; IS_ENABLED(CONFIG_X86_5LEVEL) && i < PTRS_PER_P4D; i++)
 		kasan_zero_p4d[i] = __p4d(p4d_val);
 
 	kasan_map_early_shadow(early_top_pgt);
@@ -117,12 +153,35 @@ void __init kasan_init(void)
 #endif
 
 	memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt));
+
+	/*
+	 * We use the same shadow offset for 4- and 5-level paging to
+	 * facilitate boot-time switching between paging modes.
+	 * As result in 5-level paging mode KASAN_SHADOW_START and
+	 * KASAN_SHADOW_END are not aligned to PGD boundary.
+	 *
+	 * KASAN_SHADOW_START doesn't share PGD with anything else.
+	 * We claim whole PGD entry to make things easier.
+	 *
+	 * KASAN_SHADOW_END lands in the last PGD entry and it collides with
+	 * bunch of things like kernel code, modules, EFI mapping, etc.
+	 * We need to take extra steps to not overwrite them.
+	 */
+	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+		void *ptr;
+
+		ptr = (void *)pgd_page_vaddr(*pgd_offset_k(KASAN_SHADOW_END));
+		memcpy(tmp_p4d_table, (void *)ptr, sizeof(tmp_p4d_table));
+		set_pgd(&early_top_pgt[pgd_index(KASAN_SHADOW_END)],
+				__pgd(__pa(tmp_p4d_table) | _KERNPG_TABLE));
+	}
+
 	load_cr3(early_top_pgt);
 	__flush_tlb_all();
 
-	clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
+	clear_pgds(KASAN_SHADOW_START & PGDIR_MASK, KASAN_SHADOW_END);
 
-	kasan_populate_zero_shadow((void *)KASAN_SHADOW_START,
+	kasan_populate_zero_shadow((void *)(KASAN_SHADOW_START & PGDIR_MASK),
 			kasan_mem_to_shadow((void *)PAGE_OFFSET));
 
 	for (i = 0; i < E820_MAX_ENTRIES; i++) {

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/mm] x86/xen: Provide pre-built page tables only for CONFIG_XEN_PV=y and CONFIG_XEN_PVH=y
  2017-09-29 14:08   ` Kirill A. Shutemov
  (?)
@ 2017-10-20 12:28   ` tip-bot for Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2017-10-20 12:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: gorcunov, torvalds, tglx, mingo, peterz, bp, jgross,
	linux-kernel, kirill.shutemov, luto, hpa, akpm

Commit-ID:  4375c29985f155d7eb2346615d84e62d1b673682
Gitweb:     https://git.kernel.org/tip/4375c29985f155d7eb2346615d84e62d1b673682
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Fri, 29 Sep 2017 17:08:19 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 20 Oct 2017 13:07:10 +0200

x86/xen: Provide pre-built page tables only for CONFIG_XEN_PV=y and CONFIG_XEN_PVH=y

Looks like we only need pre-built page tables in the CONFIG_XEN_PV=y and
CONFIG_XEN_PVH=y cases.

Let's not provide them for other configurations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170929140821.37654-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/head_64.S | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 513cbb0..2be7d1e 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -37,11 +37,12 @@
  *
  */
 
-#define p4d_index(x)	(((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
 #define pud_index(x)	(((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
 
+#if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
 PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
 PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
+#endif
 L3_START_KERNEL = pud_index(__START_KERNEL_map)
 
 	.text
@@ -361,10 +362,7 @@ NEXT_PAGE(early_dynamic_pgts)
 
 	.data
 
-#ifndef CONFIG_XEN
-NEXT_PAGE(init_top_pgt)
-	.fill	512,8,0
-#else
+#if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
 NEXT_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
@@ -381,6 +379,9 @@ NEXT_PAGE(level2_ident_pgt)
 	 * Don't set NX because code runs from these pages.
 	 */
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
+#else
+NEXT_PAGE(init_top_pgt)
+	.fill	512,8,0
 #endif
 
 #ifdef CONFIG_X86_5LEVEL

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip:x86/mm] x86/xen: Drop 5-level paging support code from the XEN_PV code
  2017-09-29 14:08   ` Kirill A. Shutemov
  (?)
@ 2017-10-20 12:29   ` tip-bot for Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: tip-bot for Kirill A. Shutemov @ 2017-10-20 12:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, gorcunov, luto, kirill.shutemov, peterz, bp, jgross,
	linux-kernel, akpm, torvalds, mingo, tglx

Commit-ID:  773dd2fca581b0a80e5a33332cc8ee67e5a79cba
Gitweb:     https://git.kernel.org/tip/773dd2fca581b0a80e5a33332cc8ee67e5a79cba
Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
AuthorDate: Fri, 29 Sep 2017 17:08:20 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 20 Oct 2017 13:07:10 +0200

x86/xen: Drop 5-level paging support code from the XEN_PV code

It was decided 5-level paging is not going to be supported in XEN_PV.

Let's drop the dead code from the XEN_PV code.

Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170929140821.37654-6-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/xen/mmu_pv.c | 159 +++++++++++++++++++-------------------------------
 1 file changed, 60 insertions(+), 99 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 71495f1..2ccdaba 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -449,7 +449,7 @@ __visible pmd_t xen_make_pmd(pmdval_t pmd)
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_make_pmd);
 
-#if CONFIG_PGTABLE_LEVELS == 4
+#ifdef CONFIG_X86_64
 __visible pudval_t xen_pud_val(pud_t pud)
 {
 	return pte_mfn_to_pfn(pud.pud);
@@ -538,7 +538,7 @@ static void xen_set_p4d(p4d_t *ptr, p4d_t val)
 
 	xen_mc_issue(PARAVIRT_LAZY_MMU);
 }
-#endif	/* CONFIG_PGTABLE_LEVELS == 4 */
+#endif	/* CONFIG_X86_64 */
 
 static int xen_pmd_walk(struct mm_struct *mm, pmd_t *pmd,
 		int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
@@ -580,21 +580,17 @@ static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
 		int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
 		bool last, unsigned long limit)
 {
-	int i, nr, flush = 0;
+	int flush = 0;
+	pud_t *pud;
 
-	nr = last ? p4d_index(limit) + 1 : PTRS_PER_P4D;
-	for (i = 0; i < nr; i++) {
-		pud_t *pud;
 
-		if (p4d_none(p4d[i]))
-			continue;
+	if (p4d_none(*p4d))
+		return flush;
 
-		pud = pud_offset(&p4d[i], 0);
-		if (PTRS_PER_PUD > 1)
-			flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
-		flush |= xen_pud_walk(mm, pud, func,
-				last && i == nr - 1, limit);
-	}
+	pud = pud_offset(p4d, 0);
+	if (PTRS_PER_PUD > 1)
+		flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
+	flush |= xen_pud_walk(mm, pud, func, last, limit);
 	return flush;
 }
 
@@ -644,8 +640,6 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
 			continue;
 
 		p4d = p4d_offset(&pgd[i], 0);
-		if (PTRS_PER_P4D > 1)
-			flush |= (*func)(mm, virt_to_page(p4d), PT_P4D);
 		flush |= xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
 	}
 
@@ -1176,22 +1170,14 @@ static void __init xen_cleanmfnmap(unsigned long vaddr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
-	unsigned int i;
 	bool unpin;
 
 	unpin = (vaddr == 2 * PGDIR_SIZE);
 	vaddr &= PMD_MASK;
 	pgd = pgd_offset_k(vaddr);
 	p4d = p4d_offset(pgd, 0);
-	for (i = 0; i < PTRS_PER_P4D; i++) {
-		if (p4d_none(p4d[i]))
-			continue;
-		xen_cleanmfnmap_p4d(p4d + i, unpin);
-	}
-	if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
-		set_pgd(pgd, __pgd(0));
-		xen_cleanmfnmap_free_pgtbl(p4d, unpin);
-	}
+	if (!p4d_none(*p4d))
+		xen_cleanmfnmap_p4d(p4d, unpin);
 }
 
 static void __init xen_pagetable_p2m_free(void)
@@ -1692,7 +1678,7 @@ static void xen_release_pmd(unsigned long pfn)
 	xen_release_ptpage(pfn, PT_PMD);
 }
 
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 static void xen_alloc_pud(struct mm_struct *mm, unsigned long pfn)
 {
 	xen_alloc_ptpage(mm, pfn, PT_PUD);
@@ -2029,13 +2015,12 @@ static phys_addr_t __init xen_early_virt_to_phys(unsigned long vaddr)
  */
 void __init xen_relocate_p2m(void)
 {
-	phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys, p4d_phys;
+	phys_addr_t size, new_area, pt_phys, pmd_phys, pud_phys;
 	unsigned long p2m_pfn, p2m_pfn_end, n_frames, pfn, pfn_end;
-	int n_pte, n_pt, n_pmd, n_pud, n_p4d, idx_pte, idx_pt, idx_pmd, idx_pud, idx_p4d;
+	int n_pte, n_pt, n_pmd, n_pud, idx_pte, idx_pt, idx_pmd, idx_pud;
 	pte_t *pt;
 	pmd_t *pmd;
 	pud_t *pud;
-	p4d_t *p4d = NULL;
 	pgd_t *pgd;
 	unsigned long *new_p2m;
 	int save_pud;
@@ -2045,11 +2030,7 @@ void __init xen_relocate_p2m(void)
 	n_pt = roundup(size, PMD_SIZE) >> PMD_SHIFT;
 	n_pmd = roundup(size, PUD_SIZE) >> PUD_SHIFT;
 	n_pud = roundup(size, P4D_SIZE) >> P4D_SHIFT;
-	if (PTRS_PER_P4D > 1)
-		n_p4d = roundup(size, PGDIR_SIZE) >> PGDIR_SHIFT;
-	else
-		n_p4d = 0;
-	n_frames = n_pte + n_pt + n_pmd + n_pud + n_p4d;
+	n_frames = n_pte + n_pt + n_pmd + n_pud;
 
 	new_area = xen_find_free_area(PFN_PHYS(n_frames));
 	if (!new_area) {
@@ -2065,76 +2046,56 @@ void __init xen_relocate_p2m(void)
 	 * To avoid any possible virtual address collision, just use
 	 * 2 * PUD_SIZE for the new area.
 	 */
-	p4d_phys = new_area;
-	pud_phys = p4d_phys + PFN_PHYS(n_p4d);
+	pud_phys = new_area;
 	pmd_phys = pud_phys + PFN_PHYS(n_pud);
 	pt_phys = pmd_phys + PFN_PHYS(n_pmd);
 	p2m_pfn = PFN_DOWN(pt_phys) + n_pt;
 
 	pgd = __va(read_cr3_pa());
 	new_p2m = (unsigned long *)(2 * PGDIR_SIZE);
-	idx_p4d = 0;
 	save_pud = n_pud;
-	do {
-		if (n_p4d > 0) {
-			p4d = early_memremap(p4d_phys, PAGE_SIZE);
-			clear_page(p4d);
-			n_pud = min(save_pud, PTRS_PER_P4D);
-		}
-		for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
-			pud = early_memremap(pud_phys, PAGE_SIZE);
-			clear_page(pud);
-			for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
-				 idx_pmd++) {
-				pmd = early_memremap(pmd_phys, PAGE_SIZE);
-				clear_page(pmd);
-				for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
-					 idx_pt++) {
-					pt = early_memremap(pt_phys, PAGE_SIZE);
-					clear_page(pt);
-					for (idx_pte = 0;
-						 idx_pte < min(n_pte, PTRS_PER_PTE);
-						 idx_pte++) {
-						set_pte(pt + idx_pte,
-								pfn_pte(p2m_pfn, PAGE_KERNEL));
-						p2m_pfn++;
-					}
-					n_pte -= PTRS_PER_PTE;
-					early_memunmap(pt, PAGE_SIZE);
-					make_lowmem_page_readonly(__va(pt_phys));
-					pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
-							PFN_DOWN(pt_phys));
-					set_pmd(pmd + idx_pt,
-							__pmd(_PAGE_TABLE | pt_phys));
-					pt_phys += PAGE_SIZE;
+	for (idx_pud = 0; idx_pud < n_pud; idx_pud++) {
+		pud = early_memremap(pud_phys, PAGE_SIZE);
+		clear_page(pud);
+		for (idx_pmd = 0; idx_pmd < min(n_pmd, PTRS_PER_PUD);
+				idx_pmd++) {
+			pmd = early_memremap(pmd_phys, PAGE_SIZE);
+			clear_page(pmd);
+			for (idx_pt = 0; idx_pt < min(n_pt, PTRS_PER_PMD);
+					idx_pt++) {
+				pt = early_memremap(pt_phys, PAGE_SIZE);
+				clear_page(pt);
+				for (idx_pte = 0;
+						idx_pte < min(n_pte, PTRS_PER_PTE);
+						idx_pte++) {
+					set_pte(pt + idx_pte,
+							pfn_pte(p2m_pfn, PAGE_KERNEL));
+					p2m_pfn++;
 				}
-				n_pt -= PTRS_PER_PMD;
-				early_memunmap(pmd, PAGE_SIZE);
-				make_lowmem_page_readonly(__va(pmd_phys));
-				pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
-						PFN_DOWN(pmd_phys));
-				set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
-				pmd_phys += PAGE_SIZE;
+				n_pte -= PTRS_PER_PTE;
+				early_memunmap(pt, PAGE_SIZE);
+				make_lowmem_page_readonly(__va(pt_phys));
+				pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE,
+						PFN_DOWN(pt_phys));
+				set_pmd(pmd + idx_pt,
+						__pmd(_PAGE_TABLE | pt_phys));
+				pt_phys += PAGE_SIZE;
 			}
-			n_pmd -= PTRS_PER_PUD;
-			early_memunmap(pud, PAGE_SIZE);
-			make_lowmem_page_readonly(__va(pud_phys));
-			pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
-			if (n_p4d > 0)
-				set_p4d(p4d + idx_pud, __p4d(_PAGE_TABLE | pud_phys));
-			else
-				set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
-			pud_phys += PAGE_SIZE;
-		}
-		if (n_p4d > 0) {
-			save_pud -= PTRS_PER_P4D;
-			early_memunmap(p4d, PAGE_SIZE);
-			make_lowmem_page_readonly(__va(p4d_phys));
-			pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE, PFN_DOWN(p4d_phys));
-			set_pgd(pgd + 2 + idx_p4d, __pgd(_PAGE_TABLE | p4d_phys));
-			p4d_phys += PAGE_SIZE;
+			n_pt -= PTRS_PER_PMD;
+			early_memunmap(pmd, PAGE_SIZE);
+			make_lowmem_page_readonly(__va(pmd_phys));
+			pin_pagetable_pfn(MMUEXT_PIN_L2_TABLE,
+					PFN_DOWN(pmd_phys));
+			set_pud(pud + idx_pmd, __pud(_PAGE_TABLE | pmd_phys));
+			pmd_phys += PAGE_SIZE;
 		}
-	} while (++idx_p4d < n_p4d);
+		n_pmd -= PTRS_PER_PUD;
+		early_memunmap(pud, PAGE_SIZE);
+		make_lowmem_page_readonly(__va(pud_phys));
+		pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, PFN_DOWN(pud_phys));
+		set_pgd(pgd + 2 + idx_pud, __pgd(_PAGE_TABLE | pud_phys));
+		pud_phys += PAGE_SIZE;
+	}
 
 	/* Now copy the old p2m info to the new area. */
 	memcpy(new_p2m, xen_p2m_addr, size);
@@ -2361,7 +2322,7 @@ static void __init xen_post_allocator_init(void)
 	pv_mmu_ops.set_pte = xen_set_pte;
 	pv_mmu_ops.set_pmd = xen_set_pmd;
 	pv_mmu_ops.set_pud = xen_set_pud;
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	pv_mmu_ops.set_p4d = xen_set_p4d;
 #endif
 
@@ -2371,7 +2332,7 @@ static void __init xen_post_allocator_init(void)
 	pv_mmu_ops.alloc_pmd = xen_alloc_pmd;
 	pv_mmu_ops.release_pte = xen_release_pte;
 	pv_mmu_ops.release_pmd = xen_release_pmd;
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	pv_mmu_ops.alloc_pud = xen_alloc_pud;
 	pv_mmu_ops.release_pud = xen_release_pud;
 #endif
@@ -2435,14 +2396,14 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
 	.make_pmd = PV_CALLEE_SAVE(xen_make_pmd),
 	.pmd_val = PV_CALLEE_SAVE(xen_pmd_val),
 
-#if CONFIG_PGTABLE_LEVELS >= 4
+#ifdef CONFIG_X86_64
 	.pud_val = PV_CALLEE_SAVE(xen_pud_val),
 	.make_pud = PV_CALLEE_SAVE(xen_make_pud),
 	.set_p4d = xen_set_p4d_hyper,
 
 	.alloc_pud = xen_alloc_pmd_init,
 	.release_pud = xen_release_pmd_init,
-#endif	/* CONFIG_PGTABLE_LEVELS == 4 */
+#endif	/* CONFIG_X86_64 */
 
 	.activate_mm = xen_activate_mm,
 	.dup_mmap = xen_dup_mmap,

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-20  9:41         ` Kirill A. Shutemov
@ 2017-10-20 15:23           ` Ingo Molnar
  -1 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-20 15:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:

> On Fri, Oct 20, 2017 at 08:18:53AM +0000, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > > between paging modes.
> > > > > 
> > > > > Please review and consider applying.
> > > > 
> > > > Ping?
> > > 
> > > Ingo, is there anything I can do to get review easier for you?
> > 
> > Yeah, what is the conclusion on the sub-discussion of patch #2:
> > 
> >   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> > 
> > ... do we want to skip it entirely and use the other 5 patches?
> 
> Yes, please. MAX_PHYSMEM_BITS not variable yet in this part of the series.
> 
> And I will post some version the patch in the next part, if it will be
> required.

Could we add TRULY_MAX_PHYSMEM_BITS (with a better name), to be used in places 
where memory footprint is not a big concern?

Or, could we keep MAX_PHYSMEM_BITS constant, and introduce a _different_ constant 
that is dynamic, and which could be used in the cases where the 5-level paging 
config causes too much memory footprint in the common 4-level paging case?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-20 15:23           ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-20 15:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:

> On Fri, Oct 20, 2017 at 08:18:53AM +0000, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > > between paging modes.
> > > > > 
> > > > > Please review and consider applying.
> > > > 
> > > > Ping?
> > > 
> > > Ingo, is there anything I can do to get review easier for you?
> > 
> > Yeah, what is the conclusion on the sub-discussion of patch #2:
> > 
> >   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> > 
> > ... do we want to skip it entirely and use the other 5 patches?
> 
> Yes, please. MAX_PHYSMEM_BITS not variable yet in this part of the series.
> 
> And I will post some version the patch in the next part, if it will be
> required.

Could we add TRULY_MAX_PHYSMEM_BITS (with a better name), to be used in places 
where memory footprint is not a big concern?

Or, could we keep MAX_PHYSMEM_BITS constant, and introduce a _different_ constant 
that is dynamic, and which could be used in the cases where the 5-level paging 
config causes too much memory footprint in the common 4-level paging case?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-20 15:23           ` Ingo Molnar
@ 2017-10-20 16:23             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-20 16:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Fri, Oct 20, 2017 at 05:23:46PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:
> 
> > On Fri, Oct 20, 2017 at 08:18:53AM +0000, Ingo Molnar wrote:
> > > 
> > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > 
> > > > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > > > between paging modes.
> > > > > > 
> > > > > > Please review and consider applying.
> > > > > 
> > > > > Ping?
> > > > 
> > > > Ingo, is there anything I can do to get review easier for you?
> > > 
> > > Yeah, what is the conclusion on the sub-discussion of patch #2:
> > > 
> > >   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> > > 
> > > ... do we want to skip it entirely and use the other 5 patches?
> > 
> > Yes, please. MAX_PHYSMEM_BITS not variable yet in this part of the series.
> > 
> > And I will post some version the patch in the next part, if it will be
> > required.
> 
> Could we add TRULY_MAX_PHYSMEM_BITS (with a better name), to be used in places 
> where memory footprint is not a big concern?

That's what I did in the patch. See MAX_POSSIBLE_PHYSMEM_BITS.
Not sure how good the name is.

> Or, could we keep MAX_PHYSMEM_BITS constant, and introduce a _different_ constant 
> that is dynamic, and which could be used in the cases where the 5-level paging 
> config causes too much memory footprint in the common 4-level paging case?

This is more labor intensive case with unclear benefit.

Dynamic MAX_PHYSMEM_BITS doesn't cause any issue in waste majority of
cases.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-20 16:23             ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-20 16:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Fri, Oct 20, 2017 at 05:23:46PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:
> 
> > On Fri, Oct 20, 2017 at 08:18:53AM +0000, Ingo Molnar wrote:
> > > 
> > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > 
> > > > On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> > > > > > The first bunch of patches that prepare kernel to boot-time switching
> > > > > > between paging modes.
> > > > > > 
> > > > > > Please review and consider applying.
> > > > > 
> > > > > Ping?
> > > > 
> > > > Ingo, is there anything I can do to get review easier for you?
> > > 
> > > Yeah, what is the conclusion on the sub-discussion of patch #2:
> > > 
> > >   [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS
> > > 
> > > ... do we want to skip it entirely and use the other 5 patches?
> > 
> > Yes, please. MAX_PHYSMEM_BITS not variable yet in this part of the series.
> > 
> > And I will post some version the patch in the next part, if it will be
> > required.
> 
> Could we add TRULY_MAX_PHYSMEM_BITS (with a better name), to be used in places 
> where memory footprint is not a big concern?

That's what I did in the patch. See MAX_POSSIBLE_PHYSMEM_BITS.
Not sure how good the name is.

> Or, could we keep MAX_PHYSMEM_BITS constant, and introduce a _different_ constant 
> that is dynamic, and which could be used in the cases where the 5-level paging 
> config causes too much memory footprint in the common 4-level paging case?

This is more labor intensive case with unclear benefit.

Dynamic MAX_PHYSMEM_BITS doesn't cause any issue in waste majority of
cases.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-20 16:23             ` Kirill A. Shutemov
@ 2017-10-23 11:56               ` Ingo Molnar
  -1 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-23 11:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> > Or, could we keep MAX_PHYSMEM_BITS constant, and introduce a _different_ constant 
> > that is dynamic, and which could be used in the cases where the 5-level paging 
> > config causes too much memory footprint in the common 4-level paging case?
> 
> This is more labor intensive case with unclear benefit.
> 
> Dynamic MAX_PHYSMEM_BITS doesn't cause any issue in waste majority of
> cases.

Almost nothing uses it - and even in those few cases it caused problems.

Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
scenario is asking for trouble.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-23 11:56               ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-23 11:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> > Or, could we keep MAX_PHYSMEM_BITS constant, and introduce a _different_ constant 
> > that is dynamic, and which could be used in the cases where the 5-level paging 
> > config causes too much memory footprint in the common 4-level paging case?
> 
> This is more labor intensive case with unclear benefit.
> 
> Dynamic MAX_PHYSMEM_BITS doesn't cause any issue in waste majority of
> cases.

Almost nothing uses it - and even in those few cases it caused problems.

Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
scenario is asking for trouble.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-23 11:56               ` Ingo Molnar
@ 2017-10-23 12:21                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-23 12:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Mon, Oct 23, 2017 at 01:56:58PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > > Or, could we keep MAX_PHYSMEM_BITS constant, and introduce a _different_ constant 
> > > that is dynamic, and which could be used in the cases where the 5-level paging 
> > > config causes too much memory footprint in the common 4-level paging case?
> > 
> > This is more labor intensive case with unclear benefit.
> > 
> > Dynamic MAX_PHYSMEM_BITS doesn't cause any issue in wast majority of
> > cases.
> 
> Almost nothing uses it - and even in those few cases it caused problems.

It's used in many places indirectly. See MAXMEM.

> Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> scenario is asking for trouble.

We expect boot-time page mode switching to be enabled in kernel of next
generation enterprise distros. It shoudn't be that rare.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-23 12:21                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-23 12:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Mon, Oct 23, 2017 at 01:56:58PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > > Or, could we keep MAX_PHYSMEM_BITS constant, and introduce a _different_ constant 
> > > that is dynamic, and which could be used in the cases where the 5-level paging 
> > > config causes too much memory footprint in the common 4-level paging case?
> > 
> > This is more labor intensive case with unclear benefit.
> > 
> > Dynamic MAX_PHYSMEM_BITS doesn't cause any issue in wast majority of
> > cases.
> 
> Almost nothing uses it - and even in those few cases it caused problems.

It's used in many places indirectly. See MAXMEM.

> Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> scenario is asking for trouble.

We expect boot-time page mode switching to be enabled in kernel of next
generation enterprise distros. It shoudn't be that rare.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-23 12:21                 ` Kirill A. Shutemov
@ 2017-10-23 12:40                   ` Ingo Molnar
  -1 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-23 12:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > scenario is asking for trouble.
> 
> We expect boot-time page mode switching to be enabled in kernel of next
> generation enterprise distros. It shoudn't be that rare.

My point remains even with not-so-rare Kconfig dependency.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-23 12:40                   ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-23 12:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > scenario is asking for trouble.
> 
> We expect boot-time page mode switching to be enabled in kernel of next
> generation enterprise distros. It shoudn't be that rare.

My point remains even with not-so-rare Kconfig dependency.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-23 12:40                   ` Ingo Molnar
@ 2017-10-23 12:48                     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-23 12:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Mon, Oct 23, 2017 at 02:40:14PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > scenario is asking for trouble.
> > 
> > We expect boot-time page mode switching to be enabled in kernel of next
> > generation enterprise distros. It shoudn't be that rare.
> 
> My point remains even with not-so-rare Kconfig dependency.

I don't follow how introducing new variable that depends on Kconfig option
would help with the situation.

We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
where the new variable need to be used and we will in the same situation.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-23 12:48                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-23 12:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Mon, Oct 23, 2017 at 02:40:14PM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > scenario is asking for trouble.
> > 
> > We expect boot-time page mode switching to be enabled in kernel of next
> > generation enterprise distros. It shoudn't be that rare.
> 
> My point remains even with not-so-rare Kconfig dependency.

I don't follow how introducing new variable that depends on Kconfig option
would help with the situation.

We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
where the new variable need to be used and we will in the same situation.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-23 12:48                     ` Kirill A. Shutemov
@ 2017-10-24  9:40                       ` Ingo Molnar
  -1 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-24  9:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Mon, Oct 23, 2017 at 02:40:14PM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > scenario is asking for trouble.
> > > 
> > > We expect boot-time page mode switching to be enabled in kernel of next
> > > generation enterprise distros. It shoudn't be that rare.
> > 
> > My point remains even with not-so-rare Kconfig dependency.
> 
> I don't follow how introducing new variable that depends on Kconfig option
> would help with the situation.

A new, properly named variable or function (max_physmem_bits or 
max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
it is not a constant but a runtime value.

> We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
> where the new variable need to be used and we will in the same situation.

It should result in sub-optimal resource allocations worst-case, right?

We could also rename it to MAX_POSSIBLE_PHYSMEM_BITS to make it clear that the 
real number of bits can be lower.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-24  9:40                       ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-24  9:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Mon, Oct 23, 2017 at 02:40:14PM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > scenario is asking for trouble.
> > > 
> > > We expect boot-time page mode switching to be enabled in kernel of next
> > > generation enterprise distros. It shoudn't be that rare.
> > 
> > My point remains even with not-so-rare Kconfig dependency.
> 
> I don't follow how introducing new variable that depends on Kconfig option
> would help with the situation.

A new, properly named variable or function (max_physmem_bits or 
max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
it is not a constant but a runtime value.

> We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
> where the new variable need to be used and we will in the same situation.

It should result in sub-optimal resource allocations worst-case, right?

We could also rename it to MAX_POSSIBLE_PHYSMEM_BITS to make it clear that the 
real number of bits can be lower.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-17 15:42     ` Kirill A. Shutemov
@ 2017-10-24 11:32       ` hpa
  -1 siblings, 0 replies; 76+ messages in thread
From: hpa @ 2017-10-24 11:32 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, x86, Thomas Gleixner,
	Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel

On October 17, 2017 5:42:41 PM GMT+02:00, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
>> On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
>> > The first bunch of patches that prepare kernel to boot-time
>switching
>> > between paging modes.
>> > 
>> > Please review and consider applying.
>> 
>> Ping?
>
>Ingo, is there anything I can do to get review easier for you?
>
>I hoped to get boot-time switching code into v4.15...

One issue that has come up with this is what happens if the kernel is loaded above 4 GB and we need to switch page table mode.  In that case we need enough memory below the 4 GB point to hold a root page table (since we can't write the upper half of cr3 outside of 64-bit mode) and a handful of instructions.

We have no real way to know for sure what memory is safe without parsing all the memory maps and map out all the data structures that The bootloader has left for the kernel.  I'm thinking that the best way to deal with this is to add an entry in setup_data to provide a pointers, with the kernel header specifying a necessary size and alignment.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-24 11:32       ` hpa
  0 siblings, 0 replies; 76+ messages in thread
From: hpa @ 2017-10-24 11:32 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ingo Molnar
  Cc: Kirill A. Shutemov, Linus Torvalds, x86, Thomas Gleixner,
	Andrew Morton, Andy Lutomirski, Cyrill Gorcunov, Borislav Petkov,
	linux-mm, linux-kernel

On October 17, 2017 5:42:41 PM GMT+02:00, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
>> On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
>> > The first bunch of patches that prepare kernel to boot-time
>switching
>> > between paging modes.
>> > 
>> > Please review and consider applying.
>> 
>> Ping?
>
>Ingo, is there anything I can do to get review easier for you?
>
>I hoped to get boot-time switching code into v4.15...

One issue that has come up with this is what happens if the kernel is loaded above 4 GB and we need to switch page table mode.  In that case we need enough memory below the 4 GB point to hold a root page table (since we can't write the upper half of cr3 outside of 64-bit mode) and a handful of instructions.

We have no real way to know for sure what memory is safe without parsing all the memory maps and map out all the data structures that The bootloader has left for the kernel.  I'm thinking that the best way to deal with this is to add an entry in setup_data to provide a pointers, with the kernel header specifying a necessary size and alignment.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-24  9:40                       ` Ingo Molnar
@ 2017-10-24 11:38                         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-24 11:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 24, 2017 at 11:40:40AM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Mon, Oct 23, 2017 at 02:40:14PM +0200, Ingo Molnar wrote:
> > > 
> > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > 
> > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > scenario is asking for trouble.
> > > > 
> > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > generation enterprise distros. It shoudn't be that rare.
> > > 
> > > My point remains even with not-so-rare Kconfig dependency.
> > 
> > I don't follow how introducing new variable that depends on Kconfig option
> > would help with the situation.
> 
> A new, properly named variable or function (max_physmem_bits or 
> max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> it is not a constant but a runtime value.

Would we need to rename every uppercase macros that would depend on
max_physmem_bits()? Like MAXMEM.

> > We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
> > where the new variable need to be used and we will in the same situation.
> 
> It should result in sub-optimal resource allocations worst-case, right?

I don't think it's the worst case.

For instance, virt_addr_valid() depends indirectly on it:

  virt_addr_valid()
    __virt_addr_valid()
      phys_addr_valid()
        boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)

virt_addr_valid() is used in things like implementation /dev/kmem.

To me it's far more risky than occasional build breakage for
CONFIG_X86_5LEVEL=y.

> We could also rename it to MAX_POSSIBLE_PHYSMEM_BITS to make it clear that the 
> real number of bits can be lower.

If you still insist, I'll rework code as you describe, but I disagree
that's the best way to go.

We also need to make other upper case macros dynamic, like PGDIR_SHIFT or
PTRS_PER_P4D. Reworking them in the same would be *far* more complex as
they (and their derivatives) used heavily in generic code.

To me it's a lot of code for a small to none benefit.

P.S. Could you please take a look on x86/boot/compressed/64 changes?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-24 11:38                         ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-24 11:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 24, 2017 at 11:40:40AM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Mon, Oct 23, 2017 at 02:40:14PM +0200, Ingo Molnar wrote:
> > > 
> > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > 
> > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > scenario is asking for trouble.
> > > > 
> > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > generation enterprise distros. It shoudn't be that rare.
> > > 
> > > My point remains even with not-so-rare Kconfig dependency.
> > 
> > I don't follow how introducing new variable that depends on Kconfig option
> > would help with the situation.
> 
> A new, properly named variable or function (max_physmem_bits or 
> max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> it is not a constant but a runtime value.

Would we need to rename every uppercase macros that would depend on
max_physmem_bits()? Like MAXMEM.

> > We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
> > where the new variable need to be used and we will in the same situation.
> 
> It should result in sub-optimal resource allocations worst-case, right?

I don't think it's the worst case.

For instance, virt_addr_valid() depends indirectly on it:

  virt_addr_valid()
    __virt_addr_valid()
      phys_addr_valid()
        boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)

virt_addr_valid() is used in things like implementation /dev/kmem.

To me it's far more risky than occasional build breakage for
CONFIG_X86_5LEVEL=y.

> We could also rename it to MAX_POSSIBLE_PHYSMEM_BITS to make it clear that the 
> real number of bits can be lower.

If you still insist, I'll rework code as you describe, but I disagree
that's the best way to go.

We also need to make other upper case macros dynamic, like PGDIR_SHIFT or
PTRS_PER_P4D. Reworking them in the same would be *far* more complex as
they (and their derivatives) used heavily in generic code.

To me it's a lot of code for a small to none benefit.

P.S. Could you please take a look on x86/boot/compressed/64 changes?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-24 11:32       ` hpa
@ 2017-10-24 11:43         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-24 11:43 UTC (permalink / raw)
  To: hpa
  Cc: Ingo Molnar, Kirill A. Shutemov, Linus Torvalds, x86,
	Thomas Gleixner, Andrew Morton, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 24, 2017 at 01:32:51PM +0200, hpa@zytor.com wrote:
> On October 17, 2017 5:42:41 PM GMT+02:00, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> >On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> >> > The first bunch of patches that prepare kernel to boot-time
> >switching
> >> > between paging modes.
> >> > 
> >> > Please review and consider applying.
> >> 
> >> Ping?
> >
> >Ingo, is there anything I can do to get review easier for you?
> >
> >I hoped to get boot-time switching code into v4.15...
> 
> One issue that has come up with this is what happens if the kernel is
> loaded above 4 GB and we need to switch page table mode.  In that case
> we need enough memory below the 4 GB point to hold a root page table
> (since we can't write the upper half of cr3 outside of 64-bit mode) and
> a handful of instructions.
> 
> We have no real way to know for sure what memory is safe without parsing
> all the memory maps and map out all the data structures that The
> bootloader has left for the kernel.  I'm thinking that the best way to
> deal with this is to add an entry in setup_data to provide a pointers,
> with the kernel header specifying a necessary size and alignment.

I would appreciate your feedback on my take on this:

http://lkml.kernel.org/r/20171020195934.32108-1-kirill.shutemov@linux.intel.com

I don't change boot protocol, but trying to guess the safe spot in the way
similar to what we do for realmode trampoline.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-24 11:43         ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-24 11:43 UTC (permalink / raw)
  To: hpa
  Cc: Ingo Molnar, Kirill A. Shutemov, Linus Torvalds, x86,
	Thomas Gleixner, Andrew Morton, Andy Lutomirski, Cyrill Gorcunov,
	Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 24, 2017 at 01:32:51PM +0200, hpa@zytor.com wrote:
> On October 17, 2017 5:42:41 PM GMT+02:00, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> >On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
> >> > The first bunch of patches that prepare kernel to boot-time
> >switching
> >> > between paging modes.
> >> > 
> >> > Please review and consider applying.
> >> 
> >> Ping?
> >
> >Ingo, is there anything I can do to get review easier for you?
> >
> >I hoped to get boot-time switching code into v4.15...
> 
> One issue that has come up with this is what happens if the kernel is
> loaded above 4 GB and we need to switch page table mode.  In that case
> we need enough memory below the 4 GB point to hold a root page table
> (since we can't write the upper half of cr3 outside of 64-bit mode) and
> a handful of instructions.
> 
> We have no real way to know for sure what memory is safe without parsing
> all the memory maps and map out all the data structures that The
> bootloader has left for the kernel.  I'm thinking that the best way to
> deal with this is to add an entry in setup_data to provide a pointers,
> with the kernel header specifying a necessary size and alignment.

I would appreciate your feedback on my take on this:

http://lkml.kernel.org/r/20171020195934.32108-1-kirill.shutemov@linux.intel.com

I don't change boot protocol, but trying to guess the safe spot in the way
similar to what we do for realmode trampoline.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-24 11:38                         ` Kirill A. Shutemov
@ 2017-10-24 12:47                           ` Ingo Molnar
  -1 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-24 12:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Oct 24, 2017 at 11:40:40AM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Mon, Oct 23, 2017 at 02:40:14PM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > > 
> > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > scenario is asking for trouble.
> > > > > 
> > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > generation enterprise distros. It shoudn't be that rare.
> > > > 
> > > > My point remains even with not-so-rare Kconfig dependency.
> > > 
> > > I don't follow how introducing new variable that depends on Kconfig option
> > > would help with the situation.
> > 
> > A new, properly named variable or function (max_physmem_bits or 
> > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > it is not a constant but a runtime value.
> 
> Would we need to rename every uppercase macros that would depend on
> max_physmem_bits()? Like MAXMEM.

MAXMEM isn't used in too many places either - what's the total impact of it?

> > > We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
> > > where the new variable need to be used and we will in the same situation.
> > 
> > It should result in sub-optimal resource allocations worst-case, right?
> 
> I don't think it's the worst case.
> 
> For instance, virt_addr_valid() depends indirectly on it:
> 
>   virt_addr_valid()
>     __virt_addr_valid()
>       phys_addr_valid()
>         boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)
> 
> virt_addr_valid() is used in things like implementation /dev/kmem.
> 
> To me it's far more risky than occasional build breakage for
> CONFIG_X86_5LEVEL=y.

So why do we have two variables here, one boot_cpu_data.x86_phys_bits and the 
other MAX_PHYSMEM_BITS - both set once during boot?

I'm trying to find a clean solution for this all - hiding a boot time dependency 
into a constant-looking value doesn't feel clean.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-24 12:47                           ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-24 12:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Oct 24, 2017 at 11:40:40AM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Mon, Oct 23, 2017 at 02:40:14PM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > > 
> > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > scenario is asking for trouble.
> > > > > 
> > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > generation enterprise distros. It shoudn't be that rare.
> > > > 
> > > > My point remains even with not-so-rare Kconfig dependency.
> > > 
> > > I don't follow how introducing new variable that depends on Kconfig option
> > > would help with the situation.
> > 
> > A new, properly named variable or function (max_physmem_bits or 
> > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > it is not a constant but a runtime value.
> 
> Would we need to rename every uppercase macros that would depend on
> max_physmem_bits()? Like MAXMEM.

MAXMEM isn't used in too many places either - what's the total impact of it?

> > > We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
> > > where the new variable need to be used and we will in the same situation.
> > 
> > It should result in sub-optimal resource allocations worst-case, right?
> 
> I don't think it's the worst case.
> 
> For instance, virt_addr_valid() depends indirectly on it:
> 
>   virt_addr_valid()
>     __virt_addr_valid()
>       phys_addr_valid()
>         boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)
> 
> virt_addr_valid() is used in things like implementation /dev/kmem.
> 
> To me it's far more risky than occasional build breakage for
> CONFIG_X86_5LEVEL=y.

So why do we have two variables here, one boot_cpu_data.x86_phys_bits and the 
other MAX_PHYSMEM_BITS - both set once during boot?

I'm trying to find a clean solution for this all - hiding a boot time dependency 
into a constant-looking value doesn't feel clean.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-24 12:47                           ` Ingo Molnar
@ 2017-10-24 13:12                             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-24 13:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 24, 2017 at 02:47:41PM +0200, Ingo Molnar wrote:
> > > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > > scenario is asking for trouble.
> > > > > > 
> > > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > > generation enterprise distros. It shoudn't be that rare.
> > > > > 
> > > > > My point remains even with not-so-rare Kconfig dependency.
> > > > 
> > > > I don't follow how introducing new variable that depends on Kconfig option
> > > > would help with the situation.
> > > 
> > > A new, properly named variable or function (max_physmem_bits or 
> > > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > > it is not a constant but a runtime value.
> > 
> > Would we need to rename every uppercase macros that would depend on
> > max_physmem_bits()? Like MAXMEM.
> 
> MAXMEM isn't used in too many places either - what's the total impact of it?

The impact is not very small. The tree of macros dependent on
MAX_PHYSMEM_BITS:

MAX_PHYSMEM_BITS
  MAXMEM
    KEXEC_SOURCE_MEMORY_LIMIT
    KEXEC_DESTINATION_MEMORY_LIMIT
    KEXEC_CONTROL_MEMORY_LIMIT
  SECTIONS_SHIFT
    ZONEID_SHIFT
      ZONEID_PGSHIFT
      ZONEID_MASK

The total number of users of them is not large. It's doable. But I expect
it to be somewhat ugly, since we're partly in generic code and it would
require some kind of compatibility layer for other archtectures.

Do you want me to rename them all?

> > > > We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
> > > > where the new variable need to be used and we will in the same situation.
> > > 
> > > It should result in sub-optimal resource allocations worst-case, right?
> > 
> > I don't think it's the worst case.
> > 
> > For instance, virt_addr_valid() depends indirectly on it:
> > 
> >   virt_addr_valid()
> >     __virt_addr_valid()
> >       phys_addr_valid()
> >         boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)
> > 
> > virt_addr_valid() is used in things like implementation /dev/kmem.
> > 
> > To me it's far more risky than occasional build breakage for
> > CONFIG_X86_5LEVEL=y.
> 
> So why do we have two variables here, one boot_cpu_data.x86_phys_bits and the 
> other MAX_PHYSMEM_BITS - both set once during boot?
> 
> I'm trying to find a clean solution for this all - hiding a boot time dependency 
> into a constant-looking value doesn't feel clean.

We already have plenty of them: PAGE_OFFSET, IA32_PAGE_OFFSET,
VMALLOC_START, VMEMMAP_START, TASK_SIZE, STACK_TOP, FIXADDR_TOP...

I don't understand why you make this one a special.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-24 13:12                             ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-24 13:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 24, 2017 at 02:47:41PM +0200, Ingo Molnar wrote:
> > > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > > scenario is asking for trouble.
> > > > > > 
> > > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > > generation enterprise distros. It shoudn't be that rare.
> > > > > 
> > > > > My point remains even with not-so-rare Kconfig dependency.
> > > > 
> > > > I don't follow how introducing new variable that depends on Kconfig option
> > > > would help with the situation.
> > > 
> > > A new, properly named variable or function (max_physmem_bits or 
> > > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > > it is not a constant but a runtime value.
> > 
> > Would we need to rename every uppercase macros that would depend on
> > max_physmem_bits()? Like MAXMEM.
> 
> MAXMEM isn't used in too many places either - what's the total impact of it?

The impact is not very small. The tree of macros dependent on
MAX_PHYSMEM_BITS:

MAX_PHYSMEM_BITS
  MAXMEM
    KEXEC_SOURCE_MEMORY_LIMIT
    KEXEC_DESTINATION_MEMORY_LIMIT
    KEXEC_CONTROL_MEMORY_LIMIT
  SECTIONS_SHIFT
    ZONEID_SHIFT
      ZONEID_PGSHIFT
      ZONEID_MASK

The total number of users of them is not large. It's doable. But I expect
it to be somewhat ugly, since we're partly in generic code and it would
require some kind of compatibility layer for other archtectures.

Do you want me to rename them all?

> > > > We would end up with inverse situation: people would use MAX_PHYSMEM_BITS
> > > > where the new variable need to be used and we will in the same situation.
> > > 
> > > It should result in sub-optimal resource allocations worst-case, right?
> > 
> > I don't think it's the worst case.
> > 
> > For instance, virt_addr_valid() depends indirectly on it:
> > 
> >   virt_addr_valid()
> >     __virt_addr_valid()
> >       phys_addr_valid()
> >         boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)
> > 
> > virt_addr_valid() is used in things like implementation /dev/kmem.
> > 
> > To me it's far more risky than occasional build breakage for
> > CONFIG_X86_5LEVEL=y.
> 
> So why do we have two variables here, one boot_cpu_data.x86_phys_bits and the 
> other MAX_PHYSMEM_BITS - both set once during boot?
> 
> I'm trying to find a clean solution for this all - hiding a boot time dependency 
> into a constant-looking value doesn't feel clean.

We already have plenty of them: PAGE_OFFSET, IA32_PAGE_OFFSET,
VMALLOC_START, VMEMMAP_START, TASK_SIZE, STACK_TOP, FIXADDR_TOP...

I don't understand why you make this one a special.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-24 13:12                             ` Kirill A. Shutemov
@ 2017-10-26  7:37                               ` Ingo Molnar
  -1 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-26  7:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Oct 24, 2017 at 02:47:41PM +0200, Ingo Molnar wrote:
> > > > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > > > scenario is asking for trouble.
> > > > > > > 
> > > > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > > > generation enterprise distros. It shoudn't be that rare.
> > > > > > 
> > > > > > My point remains even with not-so-rare Kconfig dependency.
> > > > > 
> > > > > I don't follow how introducing new variable that depends on Kconfig option
> > > > > would help with the situation.
> > > > 
> > > > A new, properly named variable or function (max_physmem_bits or 
> > > > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > > > it is not a constant but a runtime value.
> > > 
> > > Would we need to rename every uppercase macros that would depend on
> > > max_physmem_bits()? Like MAXMEM.
> > 
> > MAXMEM isn't used in too many places either - what's the total impact of it?
> 
> The impact is not very small. The tree of macros dependent on
> MAX_PHYSMEM_BITS:
> 
> MAX_PHYSMEM_BITS
>   MAXMEM
>     KEXEC_SOURCE_MEMORY_LIMIT
>     KEXEC_DESTINATION_MEMORY_LIMIT
>     KEXEC_CONTROL_MEMORY_LIMIT
>   SECTIONS_SHIFT
>     ZONEID_SHIFT
>       ZONEID_PGSHIFT
>       ZONEID_MASK
> 
> The total number of users of them is not large. It's doable. But I expect
> it to be somewhat ugly, since we're partly in generic code and it would
> require some kind of compatibility layer for other archtectures.
> 
> Do you want me to rename them all?

Yeah, I think these former constants should be organized better.

Here's their usage frequency:

 triton:~/tip> for N in MAX_PHYSMEM_BITS MAXMEM KEXEC_SOURCE_MEMORY_LIMIT \
 KEXEC_DESTINATION_MEMORY_LIMIT KEXEC_CONTROL_MEMORY_LIMIT SECTIONS_SHIFT \
 ZONEID_SHIFT ZONEID_PGSHIFT ZONEID_MASK; do printf "  %-40s: " $N; git grep -w $N  | grep -vE 'define| \* ' | wc -l; done

   MAX_PHYSMEM_BITS                        : 10
   MAXMEM                                  : 5
   KEXEC_SOURCE_MEMORY_LIMIT               : 2
   KEXEC_DESTINATION_MEMORY_LIMIT          : 2
   KEXEC_CONTROL_MEMORY_LIMIT              : 2
   SECTIONS_SHIFT                          : 2
   ZONEID_SHIFT                            : 1
   ZONEID_PGSHIFT                          : 1
   ZONEID_MASK                             : 1

So it's not too bad to clean up, I think.

How about something like this:

	machine.physmem.max_bytes		/* ex MAXMEM */
	machine.physmem.max_bits		/* bit count of the highest in-use physical address */
	machine.physmem.zones.id_shift		/* ZONEID_SHIFT */
	machine.physmem.zones.pg_shift		/* ZONEID_PGSHIFT */
	machine.physmem.zones.id_mask		/* ZONEID_MASK */

	machine.kexec.physmem_bytes_src		/* KEXEC_SOURCE_MEMORY_LIMIT */
	machine.kexec.physmem_bytes_dst		/* KEXEC_DESTINATION_MEMORY_LIMIT */

( With perhaps 'physmem' being an alias to '&machine->physmem', so that 
  physmem->max_bytes and physmem->max_bits would be a natural thing to write. )

I'd suggest doing this in a finegrained fashion, one step at a time, introducing 
'struct machine' and 'struct physmem' and extending it gradually with new fields.

To re-discuss the virt_addr_valid() concern you raised:

> > For instance, virt_addr_valid() depends indirectly on it:
> > 
> >   virt_addr_valid()
> >     __virt_addr_valid()
> >       phys_addr_valid()
> >         boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)
> > 
> > virt_addr_valid() is used in things like implementation /dev/kmem.
> > 
> > To me it's far more risky than occasional build breakage for
> > CONFIG_X86_5LEVEL=y.
> 
> So why do we have two variables here, one boot_cpu_data.x86_phys_bits and the
> other MAX_PHYSMEM_BITS - both set once during boot?

So it's still unclear to me why virt_addr_valid() would be a problem: this 
function could probably (in a separate patch) use physmem->max_bits, which would 
make it more secure than using even a dynamic MAX_PHYSMEM_BITS: it would detect 
any physical addresses that are beyond the recognized maximum range.

I.e. all this would result in further improvements.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-26  7:37                               ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-26  7:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Oct 24, 2017 at 02:47:41PM +0200, Ingo Molnar wrote:
> > > > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > > > scenario is asking for trouble.
> > > > > > > 
> > > > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > > > generation enterprise distros. It shoudn't be that rare.
> > > > > > 
> > > > > > My point remains even with not-so-rare Kconfig dependency.
> > > > > 
> > > > > I don't follow how introducing new variable that depends on Kconfig option
> > > > > would help with the situation.
> > > > 
> > > > A new, properly named variable or function (max_physmem_bits or 
> > > > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > > > it is not a constant but a runtime value.
> > > 
> > > Would we need to rename every uppercase macros that would depend on
> > > max_physmem_bits()? Like MAXMEM.
> > 
> > MAXMEM isn't used in too many places either - what's the total impact of it?
> 
> The impact is not very small. The tree of macros dependent on
> MAX_PHYSMEM_BITS:
> 
> MAX_PHYSMEM_BITS
>   MAXMEM
>     KEXEC_SOURCE_MEMORY_LIMIT
>     KEXEC_DESTINATION_MEMORY_LIMIT
>     KEXEC_CONTROL_MEMORY_LIMIT
>   SECTIONS_SHIFT
>     ZONEID_SHIFT
>       ZONEID_PGSHIFT
>       ZONEID_MASK
> 
> The total number of users of them is not large. It's doable. But I expect
> it to be somewhat ugly, since we're partly in generic code and it would
> require some kind of compatibility layer for other archtectures.
> 
> Do you want me to rename them all?

Yeah, I think these former constants should be organized better.

Here's their usage frequency:

 triton:~/tip> for N in MAX_PHYSMEM_BITS MAXMEM KEXEC_SOURCE_MEMORY_LIMIT \
 KEXEC_DESTINATION_MEMORY_LIMIT KEXEC_CONTROL_MEMORY_LIMIT SECTIONS_SHIFT \
 ZONEID_SHIFT ZONEID_PGSHIFT ZONEID_MASK; do printf "  %-40s: " $N; git grep -w $N  | grep -vE 'define| \* ' | wc -l; done

   MAX_PHYSMEM_BITS                        : 10
   MAXMEM                                  : 5
   KEXEC_SOURCE_MEMORY_LIMIT               : 2
   KEXEC_DESTINATION_MEMORY_LIMIT          : 2
   KEXEC_CONTROL_MEMORY_LIMIT              : 2
   SECTIONS_SHIFT                          : 2
   ZONEID_SHIFT                            : 1
   ZONEID_PGSHIFT                          : 1
   ZONEID_MASK                             : 1

So it's not too bad to clean up, I think.

How about something like this:

	machine.physmem.max_bytes		/* ex MAXMEM */
	machine.physmem.max_bits		/* bit count of the highest in-use physical address */
	machine.physmem.zones.id_shift		/* ZONEID_SHIFT */
	machine.physmem.zones.pg_shift		/* ZONEID_PGSHIFT */
	machine.physmem.zones.id_mask		/* ZONEID_MASK */

	machine.kexec.physmem_bytes_src		/* KEXEC_SOURCE_MEMORY_LIMIT */
	machine.kexec.physmem_bytes_dst		/* KEXEC_DESTINATION_MEMORY_LIMIT */

( With perhaps 'physmem' being an alias to '&machine->physmem', so that 
  physmem->max_bytes and physmem->max_bits would be a natural thing to write. )

I'd suggest doing this in a finegrained fashion, one step at a time, introducing 
'struct machine' and 'struct physmem' and extending it gradually with new fields.

To re-discuss the virt_addr_valid() concern you raised:

> > For instance, virt_addr_valid() depends indirectly on it:
> > 
> >   virt_addr_valid()
> >     __virt_addr_valid()
> >       phys_addr_valid()
> >         boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)
> > 
> > virt_addr_valid() is used in things like implementation /dev/kmem.
> > 
> > To me it's far more risky than occasional build breakage for
> > CONFIG_X86_5LEVEL=y.
> 
> So why do we have two variables here, one boot_cpu_data.x86_phys_bits and the
> other MAX_PHYSMEM_BITS - both set once during boot?

So it's still unclear to me why virt_addr_valid() would be a problem: this 
function could probably (in a separate patch) use physmem->max_bits, which would 
make it more secure than using even a dynamic MAX_PHYSMEM_BITS: it would detect 
any physical addresses that are beyond the recognized maximum range.

I.e. all this would result in further improvements.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-26  7:37                               ` Ingo Molnar
@ 2017-10-26 14:40                                 ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-26 14:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Thu, Oct 26, 2017 at 09:37:52AM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Oct 24, 2017 at 02:47:41PM +0200, Ingo Molnar wrote:
> > > > > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > > > > scenario is asking for trouble.
> > > > > > > > 
> > > > > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > > > > generation enterprise distros. It shoudn't be that rare.
> > > > > > > 
> > > > > > > My point remains even with not-so-rare Kconfig dependency.
> > > > > > 
> > > > > > I don't follow how introducing new variable that depends on Kconfig option
> > > > > > would help with the situation.
> > > > > 
> > > > > A new, properly named variable or function (max_physmem_bits or 
> > > > > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > > > > it is not a constant but a runtime value.
> > > > 
> > > > Would we need to rename every uppercase macros that would depend on
> > > > max_physmem_bits()? Like MAXMEM.
> > > 
> > > MAXMEM isn't used in too many places either - what's the total impact of it?
> > 
> > The impact is not very small. The tree of macros dependent on
> > MAX_PHYSMEM_BITS:
> > 
> > MAX_PHYSMEM_BITS
> >   MAXMEM
> >     KEXEC_SOURCE_MEMORY_LIMIT
> >     KEXEC_DESTINATION_MEMORY_LIMIT
> >     KEXEC_CONTROL_MEMORY_LIMIT
> >   SECTIONS_SHIFT
> >     ZONEID_SHIFT
> >       ZONEID_PGSHIFT
> >       ZONEID_MASK
> > 
> > The total number of users of them is not large. It's doable. But I expect
> > it to be somewhat ugly, since we're partly in generic code and it would
> > require some kind of compatibility layer for other archtectures.
> > 
> > Do you want me to rename them all?
> 
> Yeah, I think these former constants should be organized better.
> 
> Here's their usage frequency:
> 
>  triton:~/tip> for N in MAX_PHYSMEM_BITS MAXMEM KEXEC_SOURCE_MEMORY_LIMIT \
>  KEXEC_DESTINATION_MEMORY_LIMIT KEXEC_CONTROL_MEMORY_LIMIT SECTIONS_SHIFT \
>  ZONEID_SHIFT ZONEID_PGSHIFT ZONEID_MASK; do printf "  %-40s: " $N; git grep -w $N  | grep -vE 'define| \* ' | wc -l; done
> 
>    MAX_PHYSMEM_BITS                        : 10
>    MAXMEM                                  : 5
>    KEXEC_SOURCE_MEMORY_LIMIT               : 2
>    KEXEC_DESTINATION_MEMORY_LIMIT          : 2
>    KEXEC_CONTROL_MEMORY_LIMIT              : 2
>    SECTIONS_SHIFT                          : 2
>    ZONEID_SHIFT                            : 1
>    ZONEID_PGSHIFT                          : 1
>    ZONEID_MASK                             : 1
> 
> So it's not too bad to clean up, I think.
> 
> How about something like this:
> 
> 	machine.physmem.max_bytes		/* ex MAXMEM */
> 	machine.physmem.max_bits		/* bit count of the highest in-use physical address */
> 	machine.physmem.zones.id_shift		/* ZONEID_SHIFT */
> 	machine.physmem.zones.pg_shift		/* ZONEID_PGSHIFT */
> 	machine.physmem.zones.id_mask		/* ZONEID_MASK */
> 
> 	machine.kexec.physmem_bytes_src		/* KEXEC_SOURCE_MEMORY_LIMIT */
> 	machine.kexec.physmem_bytes_dst		/* KEXEC_DESTINATION_MEMORY_LIMIT */
> 
> ( With perhaps 'physmem' being an alias to '&machine->physmem', so that 
>   physmem->max_bytes and physmem->max_bits would be a natural thing to write. )
> 
> I'd suggest doing this in a finegrained fashion, one step at a time, introducing 
> 'struct machine' and 'struct physmem' and extending it gradually with new fields.

I don't think this design is reasonable.

  - It introduces memory references where we haven't had them before.

    At this point all variable would fit a cache line, which is not that
    bad. But I don't see what would stop the list from growing in the
    future.

  - We loose ability to optimize out change with static branches
    (cpu_feature_enabled() instead of pgtable_l5_enabled variable).

    It's probably, not that big of an issue here, but if we are going to
    use the same approach for other dynamic macros in the patchset, it
    might be.

  - AFAICS, it requires changes to all architectures to provide such
    structures as we now partly in generic code.

    Or to introduce some kind of compatibility layer, but it would make
    the kernel as a whole uglier than cleaner. Especially, given that
    nobody beyond x86 need this.

To me it's pretty poor trade off for a clean up.

> To re-discuss the virt_addr_valid() concern you raised:
> 
> > > For instance, virt_addr_valid() depends indirectly on it:
> > > 
> > >   virt_addr_valid()
> > >     __virt_addr_valid()
> > >       phys_addr_valid()
> > >         boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)
> > > 
> > > virt_addr_valid() is used in things like implementation /dev/kmem.
> > > 
> > > To me it's far more risky than occasional build breakage for
> > > CONFIG_X86_5LEVEL=y.
> > 
> > So why do we have two variables here, one boot_cpu_data.x86_phys_bits and the
> > other MAX_PHYSMEM_BITS - both set once during boot?
> 
> So it's still unclear to me why virt_addr_valid() would be a problem: this 
> function could probably (in a separate patch) use physmem->max_bits, which would 
> make it more secure than using even a dynamic MAX_PHYSMEM_BITS: it would detect 
> any physical addresses that are beyond the recognized maximum range.

Here we discussed what would happen if we leave MAX_PHYSMEM_BITS as a
constant -- maximum possible physmem bits, regardless of paging mode --
and introduce new variable that would reflect the actual maximum.

And this was example for the case that may misbehave (not only bloat a
data structure) if we would forget to change MAX_PHYSMEM_BITS to a
new variable.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-26 14:40                                 ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-26 14:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Thu, Oct 26, 2017 at 09:37:52AM +0200, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > On Tue, Oct 24, 2017 at 02:47:41PM +0200, Ingo Molnar wrote:
> > > > > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > > > > scenario is asking for trouble.
> > > > > > > > 
> > > > > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > > > > generation enterprise distros. It shoudn't be that rare.
> > > > > > > 
> > > > > > > My point remains even with not-so-rare Kconfig dependency.
> > > > > > 
> > > > > > I don't follow how introducing new variable that depends on Kconfig option
> > > > > > would help with the situation.
> > > > > 
> > > > > A new, properly named variable or function (max_physmem_bits or 
> > > > > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > > > > it is not a constant but a runtime value.
> > > > 
> > > > Would we need to rename every uppercase macros that would depend on
> > > > max_physmem_bits()? Like MAXMEM.
> > > 
> > > MAXMEM isn't used in too many places either - what's the total impact of it?
> > 
> > The impact is not very small. The tree of macros dependent on
> > MAX_PHYSMEM_BITS:
> > 
> > MAX_PHYSMEM_BITS
> >   MAXMEM
> >     KEXEC_SOURCE_MEMORY_LIMIT
> >     KEXEC_DESTINATION_MEMORY_LIMIT
> >     KEXEC_CONTROL_MEMORY_LIMIT
> >   SECTIONS_SHIFT
> >     ZONEID_SHIFT
> >       ZONEID_PGSHIFT
> >       ZONEID_MASK
> > 
> > The total number of users of them is not large. It's doable. But I expect
> > it to be somewhat ugly, since we're partly in generic code and it would
> > require some kind of compatibility layer for other archtectures.
> > 
> > Do you want me to rename them all?
> 
> Yeah, I think these former constants should be organized better.
> 
> Here's their usage frequency:
> 
>  triton:~/tip> for N in MAX_PHYSMEM_BITS MAXMEM KEXEC_SOURCE_MEMORY_LIMIT \
>  KEXEC_DESTINATION_MEMORY_LIMIT KEXEC_CONTROL_MEMORY_LIMIT SECTIONS_SHIFT \
>  ZONEID_SHIFT ZONEID_PGSHIFT ZONEID_MASK; do printf "  %-40s: " $N; git grep -w $N  | grep -vE 'define| \* ' | wc -l; done
> 
>    MAX_PHYSMEM_BITS                        : 10
>    MAXMEM                                  : 5
>    KEXEC_SOURCE_MEMORY_LIMIT               : 2
>    KEXEC_DESTINATION_MEMORY_LIMIT          : 2
>    KEXEC_CONTROL_MEMORY_LIMIT              : 2
>    SECTIONS_SHIFT                          : 2
>    ZONEID_SHIFT                            : 1
>    ZONEID_PGSHIFT                          : 1
>    ZONEID_MASK                             : 1
> 
> So it's not too bad to clean up, I think.
> 
> How about something like this:
> 
> 	machine.physmem.max_bytes		/* ex MAXMEM */
> 	machine.physmem.max_bits		/* bit count of the highest in-use physical address */
> 	machine.physmem.zones.id_shift		/* ZONEID_SHIFT */
> 	machine.physmem.zones.pg_shift		/* ZONEID_PGSHIFT */
> 	machine.physmem.zones.id_mask		/* ZONEID_MASK */
> 
> 	machine.kexec.physmem_bytes_src		/* KEXEC_SOURCE_MEMORY_LIMIT */
> 	machine.kexec.physmem_bytes_dst		/* KEXEC_DESTINATION_MEMORY_LIMIT */
> 
> ( With perhaps 'physmem' being an alias to '&machine->physmem', so that 
>   physmem->max_bytes and physmem->max_bits would be a natural thing to write. )
> 
> I'd suggest doing this in a finegrained fashion, one step at a time, introducing 
> 'struct machine' and 'struct physmem' and extending it gradually with new fields.

I don't think this design is reasonable.

  - It introduces memory references where we haven't had them before.

    At this point all variable would fit a cache line, which is not that
    bad. But I don't see what would stop the list from growing in the
    future.

  - We loose ability to optimize out change with static branches
    (cpu_feature_enabled() instead of pgtable_l5_enabled variable).

    It's probably, not that big of an issue here, but if we are going to
    use the same approach for other dynamic macros in the patchset, it
    might be.

  - AFAICS, it requires changes to all architectures to provide such
    structures as we now partly in generic code.

    Or to introduce some kind of compatibility layer, but it would make
    the kernel as a whole uglier than cleaner. Especially, given that
    nobody beyond x86 need this.

To me it's pretty poor trade off for a clean up.

> To re-discuss the virt_addr_valid() concern you raised:
> 
> > > For instance, virt_addr_valid() depends indirectly on it:
> > > 
> > >   virt_addr_valid()
> > >     __virt_addr_valid()
> > >       phys_addr_valid()
> > >         boot_cpu_data.x86_phys_bits (initialized with MAX_PHYSMEM_BITS)
> > > 
> > > virt_addr_valid() is used in things like implementation /dev/kmem.
> > > 
> > > To me it's far more risky than occasional build breakage for
> > > CONFIG_X86_5LEVEL=y.
> > 
> > So why do we have two variables here, one boot_cpu_data.x86_phys_bits and the
> > other MAX_PHYSMEM_BITS - both set once during boot?
> 
> So it's still unclear to me why virt_addr_valid() would be a problem: this 
> function could probably (in a separate patch) use physmem->max_bits, which would 
> make it more secure than using even a dynamic MAX_PHYSMEM_BITS: it would detect 
> any physical addresses that are beyond the recognized maximum range.

Here we discussed what would happen if we leave MAX_PHYSMEM_BITS as a
constant -- maximum possible physmem bits, regardless of paging mode --
and introduce new variable that would reflect the actual maximum.

And this was example for the case that may misbehave (not only bloat a
data structure) if we would forget to change MAX_PHYSMEM_BITS to a
new variable.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-26 14:40                                 ` Kirill A. Shutemov
@ 2017-10-31  9:47                                   ` Ingo Molnar
  -1 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-31  9:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Thu, Oct 26, 2017 at 09:37:52AM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Oct 24, 2017 at 02:47:41PM +0200, Ingo Molnar wrote:
> > > > > > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > > > > > scenario is asking for trouble.
> > > > > > > > > 
> > > > > > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > > > > > generation enterprise distros. It shoudn't be that rare.
> > > > > > > > 
> > > > > > > > My point remains even with not-so-rare Kconfig dependency.
> > > > > > > 
> > > > > > > I don't follow how introducing new variable that depends on Kconfig option
> > > > > > > would help with the situation.
> > > > > > 
> > > > > > A new, properly named variable or function (max_physmem_bits or 
> > > > > > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > > > > > it is not a constant but a runtime value.
> > > > > 
> > > > > Would we need to rename every uppercase macros that would depend on
> > > > > max_physmem_bits()? Like MAXMEM.
> > > > 
> > > > MAXMEM isn't used in too many places either - what's the total impact of it?
> > > 
> > > The impact is not very small. The tree of macros dependent on
> > > MAX_PHYSMEM_BITS:
> > > 
> > > MAX_PHYSMEM_BITS
> > >   MAXMEM
> > >     KEXEC_SOURCE_MEMORY_LIMIT
> > >     KEXEC_DESTINATION_MEMORY_LIMIT
> > >     KEXEC_CONTROL_MEMORY_LIMIT
> > >   SECTIONS_SHIFT
> > >     ZONEID_SHIFT
> > >       ZONEID_PGSHIFT
> > >       ZONEID_MASK
> > > 
> > > The total number of users of them is not large. It's doable. But I expect
> > > it to be somewhat ugly, since we're partly in generic code and it would
> > > require some kind of compatibility layer for other archtectures.
> > > 
> > > Do you want me to rename them all?
> > 
> > Yeah, I think these former constants should be organized better.
> > 
> > Here's their usage frequency:
> > 
> >  triton:~/tip> for N in MAX_PHYSMEM_BITS MAXMEM KEXEC_SOURCE_MEMORY_LIMIT \
> >  KEXEC_DESTINATION_MEMORY_LIMIT KEXEC_CONTROL_MEMORY_LIMIT SECTIONS_SHIFT \
> >  ZONEID_SHIFT ZONEID_PGSHIFT ZONEID_MASK; do printf "  %-40s: " $N; git grep -w $N  | grep -vE 'define| \* ' | wc -l; done
> > 
> >    MAX_PHYSMEM_BITS                        : 10
> >    MAXMEM                                  : 5
> >    KEXEC_SOURCE_MEMORY_LIMIT               : 2
> >    KEXEC_DESTINATION_MEMORY_LIMIT          : 2
> >    KEXEC_CONTROL_MEMORY_LIMIT              : 2
> >    SECTIONS_SHIFT                          : 2
> >    ZONEID_SHIFT                            : 1
> >    ZONEID_PGSHIFT                          : 1
> >    ZONEID_MASK                             : 1
> > 
> > So it's not too bad to clean up, I think.
> > 
> > How about something like this:
> > 
> > 	machine.physmem.max_bytes		/* ex MAXMEM */
> > 	machine.physmem.max_bits		/* bit count of the highest in-use physical address */
> > 	machine.physmem.zones.id_shift		/* ZONEID_SHIFT */
> > 	machine.physmem.zones.pg_shift		/* ZONEID_PGSHIFT */
> > 	machine.physmem.zones.id_mask		/* ZONEID_MASK */
> > 
> > 	machine.kexec.physmem_bytes_src		/* KEXEC_SOURCE_MEMORY_LIMIT */
> > 	machine.kexec.physmem_bytes_dst		/* KEXEC_DESTINATION_MEMORY_LIMIT */
> > 
> > ( With perhaps 'physmem' being an alias to '&machine->physmem', so that 
> >   physmem->max_bytes and physmem->max_bits would be a natural thing to write. )
> > 
> > I'd suggest doing this in a finegrained fashion, one step at a time, introducing 
> > 'struct machine' and 'struct physmem' and extending it gradually with new fields.
> 
> I don't think this design is reasonable.
> 
>   - It introduces memory references where we haven't had them before.
> 
>     At this point all variable would fit a cache line, which is not that
>     bad. But I don't see what would stop the list from growing in the
>     future.

Is any of these actually in a hotpath?

Also, note the context: your changes turn some of these into variables. Yes, I 
suggest structuring them all and turning them all into variables, exactly because 
the majority are now dynamic, yet their _naming_ suggests that they are constants.

>   - We loose ability to optimize out change with static branches
>     (cpu_feature_enabled() instead of pgtable_l5_enabled variable).
> 
>     It's probably, not that big of an issue here, but if we are going to
>     use the same approach for other dynamic macros in the patchset, it
>     might be.

Here too I think the (vast) majority of the uses here are for bootup/setup/init 
purposes, where clarity and maintainability of code matters a lot.

>   - AFAICS, it requires changes to all architectures to provide such
>     structures as we now partly in generic code.
> 
>     Or to introduce some kind of compatibility layer, but it would make
>     the kernel as a whole uglier than cleaner. Especially, given that
>     nobody beyond x86 need this.

Yes, all the uses should be harmonized (no compatibility layer) - but as you can 
see it from the histogram I generated it's a few dozen uses, i.e. not too bad.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-31  9:47                                   ` Ingo Molnar
  0 siblings, 0 replies; 76+ messages in thread
From: Ingo Molnar @ 2017-10-31  9:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Thu, Oct 26, 2017 at 09:37:52AM +0200, Ingo Molnar wrote:
> > 
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > > On Tue, Oct 24, 2017 at 02:47:41PM +0200, Ingo Molnar wrote:
> > > > > > > > > > Making a variable that 'looks' like a constant macro dynamic in a rare Kconfig 
> > > > > > > > > > scenario is asking for trouble.
> > > > > > > > > 
> > > > > > > > > We expect boot-time page mode switching to be enabled in kernel of next
> > > > > > > > > generation enterprise distros. It shoudn't be that rare.
> > > > > > > > 
> > > > > > > > My point remains even with not-so-rare Kconfig dependency.
> > > > > > > 
> > > > > > > I don't follow how introducing new variable that depends on Kconfig option
> > > > > > > would help with the situation.
> > > > > > 
> > > > > > A new, properly named variable or function (max_physmem_bits or 
> > > > > > max_physmem_bits()) that is not all uppercase would make it abundantly clear that 
> > > > > > it is not a constant but a runtime value.
> > > > > 
> > > > > Would we need to rename every uppercase macros that would depend on
> > > > > max_physmem_bits()? Like MAXMEM.
> > > > 
> > > > MAXMEM isn't used in too many places either - what's the total impact of it?
> > > 
> > > The impact is not very small. The tree of macros dependent on
> > > MAX_PHYSMEM_BITS:
> > > 
> > > MAX_PHYSMEM_BITS
> > >   MAXMEM
> > >     KEXEC_SOURCE_MEMORY_LIMIT
> > >     KEXEC_DESTINATION_MEMORY_LIMIT
> > >     KEXEC_CONTROL_MEMORY_LIMIT
> > >   SECTIONS_SHIFT
> > >     ZONEID_SHIFT
> > >       ZONEID_PGSHIFT
> > >       ZONEID_MASK
> > > 
> > > The total number of users of them is not large. It's doable. But I expect
> > > it to be somewhat ugly, since we're partly in generic code and it would
> > > require some kind of compatibility layer for other archtectures.
> > > 
> > > Do you want me to rename them all?
> > 
> > Yeah, I think these former constants should be organized better.
> > 
> > Here's their usage frequency:
> > 
> >  triton:~/tip> for N in MAX_PHYSMEM_BITS MAXMEM KEXEC_SOURCE_MEMORY_LIMIT \
> >  KEXEC_DESTINATION_MEMORY_LIMIT KEXEC_CONTROL_MEMORY_LIMIT SECTIONS_SHIFT \
> >  ZONEID_SHIFT ZONEID_PGSHIFT ZONEID_MASK; do printf "  %-40s: " $N; git grep -w $N  | grep -vE 'define| \* ' | wc -l; done
> > 
> >    MAX_PHYSMEM_BITS                        : 10
> >    MAXMEM                                  : 5
> >    KEXEC_SOURCE_MEMORY_LIMIT               : 2
> >    KEXEC_DESTINATION_MEMORY_LIMIT          : 2
> >    KEXEC_CONTROL_MEMORY_LIMIT              : 2
> >    SECTIONS_SHIFT                          : 2
> >    ZONEID_SHIFT                            : 1
> >    ZONEID_PGSHIFT                          : 1
> >    ZONEID_MASK                             : 1
> > 
> > So it's not too bad to clean up, I think.
> > 
> > How about something like this:
> > 
> > 	machine.physmem.max_bytes		/* ex MAXMEM */
> > 	machine.physmem.max_bits		/* bit count of the highest in-use physical address */
> > 	machine.physmem.zones.id_shift		/* ZONEID_SHIFT */
> > 	machine.physmem.zones.pg_shift		/* ZONEID_PGSHIFT */
> > 	machine.physmem.zones.id_mask		/* ZONEID_MASK */
> > 
> > 	machine.kexec.physmem_bytes_src		/* KEXEC_SOURCE_MEMORY_LIMIT */
> > 	machine.kexec.physmem_bytes_dst		/* KEXEC_DESTINATION_MEMORY_LIMIT */
> > 
> > ( With perhaps 'physmem' being an alias to '&machine->physmem', so that 
> >   physmem->max_bytes and physmem->max_bits would be a natural thing to write. )
> > 
> > I'd suggest doing this in a finegrained fashion, one step at a time, introducing 
> > 'struct machine' and 'struct physmem' and extending it gradually with new fields.
> 
> I don't think this design is reasonable.
> 
>   - It introduces memory references where we haven't had them before.
> 
>     At this point all variable would fit a cache line, which is not that
>     bad. But I don't see what would stop the list from growing in the
>     future.

Is any of these actually in a hotpath?

Also, note the context: your changes turn some of these into variables. Yes, I 
suggest structuring them all and turning them all into variables, exactly because 
the majority are now dynamic, yet their _naming_ suggests that they are constants.

>   - We loose ability to optimize out change with static branches
>     (cpu_feature_enabled() instead of pgtable_l5_enabled variable).
> 
>     It's probably, not that big of an issue here, but if we are going to
>     use the same approach for other dynamic macros in the patchset, it
>     might be.

Here too I think the (vast) majority of the uses here are for bootup/setup/init 
purposes, where clarity and maintainability of code matters a lot.

>   - AFAICS, it requires changes to all architectures to provide such
>     structures as we now partly in generic code.
> 
>     Or to introduce some kind of compatibility layer, but it would make
>     the kernel as a whole uglier than cleaner. Especially, given that
>     nobody beyond x86 need this.

Yes, all the uses should be harmonized (no compatibility layer) - but as you can 
see it from the histogram I generated it's a few dozen uses, i.e. not too bad.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
  2017-10-31  9:47                                   ` Ingo Molnar
@ 2017-10-31 12:04                                     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-31 12:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 31, 2017 at 10:47:27AM +0100, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > I don't think this design is reasonable.
> > 
> >   - It introduces memory references where we haven't had them before.
> > 
> >     At this point all variable would fit a cache line, which is not that
> >     bad. But I don't see what would stop the list from growing in the
> >     future.
> 
> Is any of these actually in a hotpath?

Probably, no. Closest to hotpath I see so far is page_zone_id() in page
allocator.

> Also, note the context: your changes turn some of these into variables. Yes, I 
> suggest structuring them all and turning them all into variables, exactly because 
> the majority are now dynamic, yet their _naming_ suggests that they are constants.

Another way to put it would be that you suggest significant rework of kernel
machinery based on cosmetic nitpick. :)

> >   - We loose ability to optimize out change with static branches
> >     (cpu_feature_enabled() instead of pgtable_l5_enabled variable).
> > 
> >     It's probably, not that big of an issue here, but if we are going to
> >     use the same approach for other dynamic macros in the patchset, it
> >     might be.
> 
> Here too I think the (vast) majority of the uses here are for bootup/setup/init 
> purposes, where clarity and maintainability of code matters a lot.

I would argue that it makes maintainability worse.

It makes dependencies between values less obvious. For instance, checking
MAXMEM definition on x86-64 makes it obvious that it depends directly
on MAX_PHYSMEM_BITS.

If we would convert MAXMEM to variable, we would need to check where the
variable is initialized and make sure that nobody changes it afterwards.

Does it sound like a win for maintainability?

> >   - AFAICS, it requires changes to all architectures to provide such
> >     structures as we now partly in generic code.
> > 
> >     Or to introduce some kind of compatibility layer, but it would make
> >     the kernel as a whole uglier than cleaner. Especially, given that
> >     nobody beyond x86 need this.
> 
> Yes, all the uses should be harmonized (no compatibility layer) - but as you can 
> see it from the histogram I generated it's a few dozen uses, i.e. not too bad.

Without a compatibility layer, I would need to change every architecture.
It's few dozen patches easily. Not fun.

---------------------------------8<------------------------------------

Putting, my disagreement with the design aside, I try to prototype it.
And stumble an issue that I don't see how to solve.

If we are going to convert macros to variable whether they need to be
variable in the configuration we quickly put ourself into corner:

 - SECTIONS_SHIFT is dependent on MAX_PHYSMEM_BITS.

 - SECTIONS_SHIFT is used to define SECTIONS_WIDTH, but only if
   CONFIG_SPARSEMEM_VMEMMAP is not enabled. SECTIONS_WIDTH is zero
   otherwise.

At this point we can convert both SECTIONS_SHIFT and SECTIONS_WIDTH to
variables.

But SECTIONS_WIDTH used on preprocessor level to determinate NODES_WIDTH,
which used to determinate if we going to define NODE_NOT_IN_PAGE_FLAGS and
the value of LAST_CPUPID_WIDTH.

Making SECTIONS_WIDTH variable breaks the preprocessor logic. But problems
don't stop there:

  - LAST_CPUPID_WIDTH determinate if LAST_CPUPID_NOT_IN_PAGE_FLAGS is defined.

  - LAST_CPUPID_NOT_IN_PAGE_FLAGS is used define struct page and therefore
    cannot be dynamic (read variable).


In my patchset I made X86_5LEVEL select SPARSEMEM_VMEMMAP. It breaks the
chain and SECTIONS_WIDTH is never dynamic.

But how get it work with the design?

I can only think of hack like making machine.physmem.sections.shift a
constant macro if we don't want it dynamic for the configuration and leave
SECTHION_WITH as a constant in generic code.

To me it's ugly as hell.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1
@ 2017-10-31 12:04                                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-10-31 12:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kirill A. Shutemov, Ingo Molnar, Linus Torvalds, x86,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Andy Lutomirski,
	Cyrill Gorcunov, Borislav Petkov, linux-mm, linux-kernel

On Tue, Oct 31, 2017 at 10:47:27AM +0100, Ingo Molnar wrote:
> 
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > I don't think this design is reasonable.
> > 
> >   - It introduces memory references where we haven't had them before.
> > 
> >     At this point all variable would fit a cache line, which is not that
> >     bad. But I don't see what would stop the list from growing in the
> >     future.
> 
> Is any of these actually in a hotpath?

Probably, no. Closest to hotpath I see so far is page_zone_id() in page
allocator.

> Also, note the context: your changes turn some of these into variables. Yes, I 
> suggest structuring them all and turning them all into variables, exactly because 
> the majority are now dynamic, yet their _naming_ suggests that they are constants.

Another way to put it would be that you suggest significant rework of kernel
machinery based on cosmetic nitpick. :)

> >   - We loose ability to optimize out change with static branches
> >     (cpu_feature_enabled() instead of pgtable_l5_enabled variable).
> > 
> >     It's probably, not that big of an issue here, but if we are going to
> >     use the same approach for other dynamic macros in the patchset, it
> >     might be.
> 
> Here too I think the (vast) majority of the uses here are for bootup/setup/init 
> purposes, where clarity and maintainability of code matters a lot.

I would argue that it makes maintainability worse.

It makes dependencies between values less obvious. For instance, checking
MAXMEM definition on x86-64 makes it obvious that it depends directly
on MAX_PHYSMEM_BITS.

If we would convert MAXMEM to variable, we would need to check where the
variable is initialized and make sure that nobody changes it afterwards.

Does it sound like a win for maintainability?

> >   - AFAICS, it requires changes to all architectures to provide such
> >     structures as we now partly in generic code.
> > 
> >     Or to introduce some kind of compatibility layer, but it would make
> >     the kernel as a whole uglier than cleaner. Especially, given that
> >     nobody beyond x86 need this.
> 
> Yes, all the uses should be harmonized (no compatibility layer) - but as you can 
> see it from the histogram I generated it's a few dozen uses, i.e. not too bad.

Without a compatibility layer, I would need to change every architecture.
It's few dozen patches easily. Not fun.

---------------------------------8<------------------------------------

Putting, my disagreement with the design aside, I try to prototype it.
And stumble an issue that I don't see how to solve.

If we are going to convert macros to variable whether they need to be
variable in the configuration we quickly put ourself into corner:

 - SECTIONS_SHIFT is dependent on MAX_PHYSMEM_BITS.

 - SECTIONS_SHIFT is used to define SECTIONS_WIDTH, but only if
   CONFIG_SPARSEMEM_VMEMMAP is not enabled. SECTIONS_WIDTH is zero
   otherwise.

At this point we can convert both SECTIONS_SHIFT and SECTIONS_WIDTH to
variables.

But SECTIONS_WIDTH used on preprocessor level to determinate NODES_WIDTH,
which used to determinate if we going to define NODE_NOT_IN_PAGE_FLAGS and
the value of LAST_CPUPID_WIDTH.

Making SECTIONS_WIDTH variable breaks the preprocessor logic. But problems
don't stop there:

  - LAST_CPUPID_WIDTH determinate if LAST_CPUPID_NOT_IN_PAGE_FLAGS is defined.

  - LAST_CPUPID_NOT_IN_PAGE_FLAGS is used define struct page and therefore
    cannot be dynamic (read variable).


In my patchset I made X86_5LEVEL select SPARSEMEM_VMEMMAP. It breaks the
chain and SECTIONS_WIDTH is never dynamic.

But how get it work with the design?

I can only think of hack like making machine.physmem.sections.shift a
constant macro if we don't want it dynamic for the configuration and leave
SECTHION_WITH as a constant in generic code.

To me it's ugly as hell.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-10-20 12:27   ` [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y tip-bot for Kirill A. Shutemov
@ 2017-11-02 12:31     ` Sudeep Holla
  2017-11-02 13:34       ` Kirill A. Shutemov
  0 siblings, 1 reply; 76+ messages in thread
From: Sudeep Holla @ 2017-11-02 12:31 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, kirill.shutemov, Peter Zijlstra,
	hpa, Andrew Morton, gorcunov, luto, bp, open list, torvalds
  Cc: Will Deacon, Catalin Marinas

(+Will, Catalin)

On Fri, Oct 20, 2017 at 1:27 PM, tip-bot for Kirill A. Shutemov
<tipbot@zytor.com> wrote:
> Commit-ID:  83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
> Gitweb:     https://git.kernel.org/tip/83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
> Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> AuthorDate: Fri, 29 Sep 2017 17:08:16 +0300
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Fri, 20 Oct 2017 13:07:09 +0200
>
> mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
>
> Size of the mem_section[] array depends on the size of the physical address space.
>
> In preparation for boot-time switching between paging modes on x86-64
> we need to make the allocation of mem_section[] dynamic, because otherwise
> we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
> for 4-level paging and 2MB for 5-level paging mode.
>
> The patch allocates the array on the first call to sparse_memory_present_with_active_regions().
>

I am seeing a boot failure with this patch in linux-next today(20171102)

Unable to handle kernel NULL pointer dereference at virtual address 00000000
Mem abort info:
  ESR = 0x96000004
  Exception class = DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x00000004
  CM = 0, WnR = 0
[0000000000000000] user address but active_mm is swapper
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-rc7-next-20171102 #133
Hardware name: ARM Juno development board (r2) (DT)
task: ffff000008f82a80 task.stack: ffff000008f70000
pstate: 200000c5 (nzCv daIF -PAN -UAO)
pc : memory_present+0x5c/0xe8
lr : memory_present+0x34/0xe8
sp : ffff000008f73e90
x29: ffff000008f73e90 x28: 0000000080e60018
x27: 00000000fd9b8d18 x26: 0000000000000105
x25: 0000000000000000 x24: ffff0000090c4000
x23: 0000000000000000 x22: ffff0000090c4000
x21: 0000000000080000 x20: 0000000000000004
x19: 0000000000000000 x18: 0000000000000010
x17: 0000000000000001 x16: 0000000000000000
x15: ffffffffffffffff x14: ffff00008909a3af
x13: ffff00000909a3bd x12: ffff000008f79df0
x11: ffff000008590de8 x10: ffff000008f9c7f0
x9 : 0000000000000000 x8 : ffff80097ffccc80
x7 : 0000000000000000 x6 : 000000000000003f
x5 : ffff000008f79fc0 x4 : 0000000000000001
x3 : 0000001000000000 x2 : 00000000000e0000
x1 : 0000000000080000 x0 : 0000000000000000
Process swapper (pid: 0, stack limit = 0xffff000008f70000)
Call trace:
 memory_present+0x5c/0xe8
 bootmem_init+0x90/0x114
 setup_arch+0x190/0x4a0
 start_kernel+0x64/0x3a8
Code: 54000449 d35afeb3 f94032c0 d37df273 (f8736800)
random: get_random_bytes called from print_oops_end_marker+0x4c/0x68
with crng_init=0
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task!

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-11-02 12:31     ` Sudeep Holla
@ 2017-11-02 13:34       ` Kirill A. Shutemov
  2017-11-02 13:42         ` Sudeep Holla
  0 siblings, 1 reply; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-11-02 13:34 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Ingo Molnar, Thomas Gleixner, kirill.shutemov, Peter Zijlstra,
	hpa, Andrew Morton, gorcunov, luto, bp, open list, torvalds,
	Will Deacon, Catalin Marinas

On Thu, Nov 02, 2017 at 12:31:54PM +0000, Sudeep Holla wrote:
> (+Will, Catalin)
> 
> On Fri, Oct 20, 2017 at 1:27 PM, tip-bot for Kirill A. Shutemov
> <tipbot@zytor.com> wrote:
> > Commit-ID:  83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
> > Gitweb:     https://git.kernel.org/tip/83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
> > Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > AuthorDate: Fri, 29 Sep 2017 17:08:16 +0300
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Fri, 20 Oct 2017 13:07:09 +0200
> >
> > mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
> >
> > Size of the mem_section[] array depends on the size of the physical address space.
> >
> > In preparation for boot-time switching between paging modes on x86-64
> > we need to make the allocation of mem_section[] dynamic, because otherwise
> > we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
> > for 4-level paging and 2MB for 5-level paging mode.
> >
> > The patch allocates the array on the first call to sparse_memory_present_with_active_regions().
> >
> 
> I am seeing a boot failure with this patch in linux-next today(20171102)

Could you share the kernel config?

Have you bisected the failure to the commit?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-11-02 13:34       ` Kirill A. Shutemov
@ 2017-11-02 13:42         ` Sudeep Holla
  2017-11-02 14:12           ` Kirill A. Shutemov
  0 siblings, 1 reply; 76+ messages in thread
From: Sudeep Holla @ 2017-11-02 13:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sudeep Holla, Ingo Molnar, Thomas Gleixner, kirill.shutemov,
	Peter Zijlstra, hpa, Andrew Morton, gorcunov, luto, bp,
	open list, torvalds, Will Deacon, Catalin Marinas



On 02/11/17 13:34, Kirill A. Shutemov wrote:
> On Thu, Nov 02, 2017 at 12:31:54PM +0000, Sudeep Holla wrote:
>> (+Will, Catalin)
>>
>> On Fri, Oct 20, 2017 at 1:27 PM, tip-bot for Kirill A. Shutemov
>> <tipbot@zytor.com> wrote:
>>> Commit-ID:  83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
>>> Gitweb:     https://git.kernel.org/tip/83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
>>> Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> AuthorDate: Fri, 29 Sep 2017 17:08:16 +0300
>>> Committer:  Ingo Molnar <mingo@kernel.org>
>>> CommitDate: Fri, 20 Oct 2017 13:07:09 +0200
>>>
>>> mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
>>>
>>> Size of the mem_section[] array depends on the size of the physical address space.
>>>
>>> In preparation for boot-time switching between paging modes on x86-64
>>> we need to make the allocation of mem_section[] dynamic, because otherwise
>>> we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
>>> for 4-level paging and 2MB for 5-level paging mode.
>>>
>>> The patch allocates the array on the first call to sparse_memory_present_with_active_regions().
>>>
>>
>> I am seeing a boot failure with this patch in linux-next today(20171102)
> 
> Could you share the kernel config?
> 

It's the default config on arm64. Generated file is almost 160kB, I will
send it to you off-list.

> Have you bisected the failure to the commit?
> 
I just reverted this commit as I suspected that and it boots fine after
the revert.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-11-02 13:42         ` Sudeep Holla
@ 2017-11-02 14:12           ` Kirill A. Shutemov
  2017-11-02 15:07             ` Sudeep Holla
                               ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2017-11-02 14:12 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Kirill A. Shutemov, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	hpa, Andrew Morton, gorcunov, luto, bp, open list, torvalds,
	Will Deacon, Catalin Marinas

On Thu, Nov 02, 2017 at 01:42:42PM +0000, Sudeep Holla wrote:
> 
> 
> On 02/11/17 13:34, Kirill A. Shutemov wrote:
> > On Thu, Nov 02, 2017 at 12:31:54PM +0000, Sudeep Holla wrote:
> >> (+Will, Catalin)
> >>
> >> On Fri, Oct 20, 2017 at 1:27 PM, tip-bot for Kirill A. Shutemov
> >> <tipbot@zytor.com> wrote:
> >>> Commit-ID:  83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
> >>> Gitweb:     https://git.kernel.org/tip/83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
> >>> Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>> AuthorDate: Fri, 29 Sep 2017 17:08:16 +0300
> >>> Committer:  Ingo Molnar <mingo@kernel.org>
> >>> CommitDate: Fri, 20 Oct 2017 13:07:09 +0200
> >>>
> >>> mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
> >>>
> >>> Size of the mem_section[] array depends on the size of the physical address space.
> >>>
> >>> In preparation for boot-time switching between paging modes on x86-64
> >>> we need to make the allocation of mem_section[] dynamic, because otherwise
> >>> we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
> >>> for 4-level paging and 2MB for 5-level paging mode.
> >>>
> >>> The patch allocates the array on the first call to sparse_memory_present_with_active_regions().
> >>>
> >>
> >> I am seeing a boot failure with this patch in linux-next today(20171102)
> > 
> > Could you share the kernel config?
> > 
> 
> It's the default config on arm64. Generated file is almost 160kB, I will
> send it to you off-list.
> 
> > Have you bisected the failure to the commit?
> > 
> I just reverted this commit as I suspected that and it boots fine after
> the revert.

Could you try the patch below instead?

>From 4a9d843f9d939d958612b0079ebe5743f265e1e0 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 2 Nov 2017 17:02:29 +0300
Subject: [PATCH] mm, sparse: Fix boot on arm64

Since 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for
CONFIG_SPARSEMEM_EXTREME=y") we allocate mem_section dynamically in
sparse_memory_present_with_active_regions(). But some architectures, like
arm64, don't use the routine to initialize sparsemem.

Let's move the initialization into memory_present() it should cover all
architectures.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
---
 mm/page_alloc.c | 10 ----------
 mm/sparse.c     | 10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8dfd13f724d9..77e4d3c5c57b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5646,16 +5646,6 @@ void __init sparse_memory_present_with_active_regions(int nid)
 	unsigned long start_pfn, end_pfn;
 	int i, this_nid;
 
-#ifdef CONFIG_SPARSEMEM_EXTREME
-	if (!mem_section) {
-		unsigned long size, align;
-
-		size = sizeof(struct mem_section) * NR_SECTION_ROOTS;
-		align = 1 << (INTERNODE_CACHE_SHIFT);
-		mem_section = memblock_virt_alloc(size, align);
-	}
-#endif
-
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, &this_nid)
 		memory_present(this_nid, start_pfn, end_pfn);
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index b00a97398795..d294148ba395 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -206,6 +206,16 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
 {
 	unsigned long pfn;
 
+#ifdef CONFIG_SPARSEMEM_EXTREME
+	if (unlikely(!mem_section)) {
+		unsigned long size, align;
+
+		size = sizeof(struct mem_section) * NR_SECTION_ROOTS;
+		align = 1 << (INTERNODE_CACHE_SHIFT);
+		mem_section = memblock_virt_alloc(size, align);
+	}
+#endif
+
 	start &= PAGE_SECTION_MASK;
 	mminit_validate_memmodel_limits(&start, &end);
 	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-11-02 14:12           ` Kirill A. Shutemov
@ 2017-11-02 15:07             ` Sudeep Holla
  2017-11-02 15:37             ` Thierry Reding
  2017-11-06 19:00             ` Bjorn Andersson
  2 siblings, 0 replies; 76+ messages in thread
From: Sudeep Holla @ 2017-11-02 15:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sudeep Holla, Kirill A. Shutemov, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, hpa, Andrew Morton, gorcunov, luto, bp,
	open list, torvalds, Will Deacon, Catalin Marinas



On 02/11/17 14:12, Kirill A. Shutemov wrote:
> On Thu, Nov 02, 2017 at 01:42:42PM +0000, Sudeep Holla wrote:
>>
>>
>> On 02/11/17 13:34, Kirill A. Shutemov wrote:
>>> On Thu, Nov 02, 2017 at 12:31:54PM +0000, Sudeep Holla wrote:
>>>> (+Will, Catalin)
>>>>
>>>> On Fri, Oct 20, 2017 at 1:27 PM, tip-bot for Kirill A. Shutemov
>>>> <tipbot@zytor.com> wrote:
>>>>> Commit-ID:  83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
>>>>> Gitweb:     https://git.kernel.org/tip/83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
>>>>> Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>>> AuthorDate: Fri, 29 Sep 2017 17:08:16 +0300
>>>>> Committer:  Ingo Molnar <mingo@kernel.org>
>>>>> CommitDate: Fri, 20 Oct 2017 13:07:09 +0200
>>>>>
>>>>> mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
>>>>>
>>>>> Size of the mem_section[] array depends on the size of the physical address space.
>>>>>
>>>>> In preparation for boot-time switching between paging modes on x86-64
>>>>> we need to make the allocation of mem_section[] dynamic, because otherwise
>>>>> we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
>>>>> for 4-level paging and 2MB for 5-level paging mode.
>>>>>
>>>>> The patch allocates the array on the first call to sparse_memory_present_with_active_regions().
>>>>>
>>>>
>>>> I am seeing a boot failure with this patch in linux-next today(20171102)
>>>
>>> Could you share the kernel config?
>>>
>>
>> It's the default config on arm64. Generated file is almost 160kB, I will
>> send it to you off-list.
>>
>>> Have you bisected the failure to the commit?
>>>
>> I just reverted this commit as I suspected that and it boots fine after
>> the revert.
> 
> Could you try the patch below instead?
> 
> From 4a9d843f9d939d958612b0079ebe5743f265e1e0 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Thu, 2 Nov 2017 17:02:29 +0300
> Subject: [PATCH] mm, sparse: Fix boot on arm64
> 
> Since 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for
> CONFIG_SPARSEMEM_EXTREME=y") we allocate mem_section dynamically in
> sparse_memory_present_with_active_regions(). But some architectures, like
> arm64, don't use the routine to initialize sparsemem.
> 
> Let's move the initialization into memory_present() it should cover all
> architectures.
> 

Thanks for the quick fix. It boots fine with this patch.

Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-11-02 14:12           ` Kirill A. Shutemov
  2017-11-02 15:07             ` Sudeep Holla
@ 2017-11-02 15:37             ` Thierry Reding
  2017-11-06 19:00             ` Bjorn Andersson
  2 siblings, 0 replies; 76+ messages in thread
From: Thierry Reding @ 2017-11-02 15:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sudeep Holla, Kirill A. Shutemov, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, hpa, Andrew Morton, gorcunov, luto, bp,
	open list, torvalds, Will Deacon, Catalin Marinas

[-- Attachment #1: Type: text/plain, Size: 2969 bytes --]

On Thu, Nov 02, 2017 at 05:12:11PM +0300, Kirill A. Shutemov wrote:
> On Thu, Nov 02, 2017 at 01:42:42PM +0000, Sudeep Holla wrote:
> > 
> > 
> > On 02/11/17 13:34, Kirill A. Shutemov wrote:
> > > On Thu, Nov 02, 2017 at 12:31:54PM +0000, Sudeep Holla wrote:
> > >> (+Will, Catalin)
> > >>
> > >> On Fri, Oct 20, 2017 at 1:27 PM, tip-bot for Kirill A. Shutemov
> > >> <tipbot@zytor.com> wrote:
> > >>> Commit-ID:  83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
> > >>> Gitweb:     https://git.kernel.org/tip/83e3c48729d9ebb7af5a31a504f3fd6aff0348c4
> > >>> Author:     Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > >>> AuthorDate: Fri, 29 Sep 2017 17:08:16 +0300
> > >>> Committer:  Ingo Molnar <mingo@kernel.org>
> > >>> CommitDate: Fri, 20 Oct 2017 13:07:09 +0200
> > >>>
> > >>> mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
> > >>>
> > >>> Size of the mem_section[] array depends on the size of the physical address space.
> > >>>
> > >>> In preparation for boot-time switching between paging modes on x86-64
> > >>> we need to make the allocation of mem_section[] dynamic, because otherwise
> > >>> we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
> > >>> for 4-level paging and 2MB for 5-level paging mode.
> > >>>
> > >>> The patch allocates the array on the first call to sparse_memory_present_with_active_regions().
> > >>>
> > >>
> > >> I am seeing a boot failure with this patch in linux-next today(20171102)
> > > 
> > > Could you share the kernel config?
> > > 
> > 
> > It's the default config on arm64. Generated file is almost 160kB, I will
> > send it to you off-list.
> > 
> > > Have you bisected the failure to the commit?
> > > 
> > I just reverted this commit as I suspected that and it boots fine after
> > the revert.
> 
> Could you try the patch below instead?
> 
> From 4a9d843f9d939d958612b0079ebe5743f265e1e0 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Thu, 2 Nov 2017 17:02:29 +0300
> Subject: [PATCH] mm, sparse: Fix boot on arm64
> 
> Since 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for
> CONFIG_SPARSEMEM_EXTREME=y") we allocate mem_section dynamically in
> sparse_memory_present_with_active_regions(). But some architectures, like
> arm64, don't use the routine to initialize sparsemem.
> 
> Let's move the initialization into memory_present() it should cover all
> architectures.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
> ---
>  mm/page_alloc.c | 10 ----------
>  mm/sparse.c     | 10 ++++++++++
>  2 files changed, 10 insertions(+), 10 deletions(-)

I can also confirm that this restores booting on 64-bit ARM (Tegra186,
Jetson TX2, specifically):

Tested-by: Thierry Reding <treding@nvidia.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-11-02 14:12           ` Kirill A. Shutemov
  2017-11-02 15:07             ` Sudeep Holla
  2017-11-02 15:37             ` Thierry Reding
@ 2017-11-06 19:00             ` Bjorn Andersson
  2017-11-07  1:15               ` Will Deacon
  2 siblings, 1 reply; 76+ messages in thread
From: Bjorn Andersson @ 2017-11-06 19:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sudeep Holla, Kirill A. Shutemov, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, hpa, Andrew Morton, gorcunov, Andy Lutomirski,
	bp, open list, Linus Torvalds, Will Deacon, Catalin Marinas

On Thu, Nov 2, 2017 at 7:12 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
[..]
> Could you try the patch below instead?
>
> From 4a9d843f9d939d958612b0079ebe5743f265e1e0 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Thu, 2 Nov 2017 17:02:29 +0300
> Subject: [PATCH] mm, sparse: Fix boot on arm64
>
> Since 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for
> CONFIG_SPARSEMEM_EXTREME=y") we allocate mem_section dynamically in
> sparse_memory_present_with_active_regions(). But some architectures, like
> arm64, don't use the routine to initialize sparsemem.
>
> Let's move the initialization into memory_present() it should cover all
> architectures.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")

Will you send out this patch, or will someone pick it up from here?

As with the other arm64 boards this is the difference between
linux-next (and presumably v4.15-rc1) booting or not on my Qualcomm
boards.

Tested-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y
  2017-11-06 19:00             ` Bjorn Andersson
@ 2017-11-07  1:15               ` Will Deacon
  0 siblings, 0 replies; 76+ messages in thread
From: Will Deacon @ 2017-11-07  1:15 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Kirill A. Shutemov, Sudeep Holla, Kirill A. Shutemov,
	Ingo Molnar, Thomas Gleixner, Peter Zijlstra, hpa, Andrew Morton,
	gorcunov, Andy Lutomirski, bp, open list, Linus Torvalds,
	Catalin Marinas

On Mon, Nov 06, 2017 at 11:00:09AM -0800, Bjorn Andersson wrote:
> On Thu, Nov 2, 2017 at 7:12 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> [..]
> > Could you try the patch below instead?
> >
> > From 4a9d843f9d939d958612b0079ebe5743f265e1e0 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Thu, 2 Nov 2017 17:02:29 +0300
> > Subject: [PATCH] mm, sparse: Fix boot on arm64
> >
> > Since 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for
> > CONFIG_SPARSEMEM_EXTREME=y") we allocate mem_section dynamically in
> > sparse_memory_present_with_active_regions(). But some architectures, like
> > arm64, don't use the routine to initialize sparsemem.
> >
> > Let's move the initialization into memory_present() it should cover all
> > architectures.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
> 
> Will you send out this patch, or will someone pick it up from here?
> 
> As with the other arm64 boards this is the difference between
> linux-next (and presumably v4.15-rc1) booting or not on my Qualcomm
> boards.
> 
> Tested-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Yes, please can somebody get this into -next asap? I can't take it via
arm64, since this code isn't present there.

If you need it:

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2017-11-07  1:15 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-29 14:08 [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1 Kirill A. Shutemov
2017-09-29 14:08 ` Kirill A. Shutemov
2017-09-29 14:08 ` [PATCH 1/6] mm/sparsemem: Allocate mem_section at runtime for SPARSEMEM_EXTREME Kirill A. Shutemov
2017-09-29 14:08   ` Kirill A. Shutemov
2017-10-20 12:27   ` [tip:x86/mm] mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y tip-bot for Kirill A. Shutemov
2017-11-02 12:31     ` Sudeep Holla
2017-11-02 13:34       ` Kirill A. Shutemov
2017-11-02 13:42         ` Sudeep Holla
2017-11-02 14:12           ` Kirill A. Shutemov
2017-11-02 15:07             ` Sudeep Holla
2017-11-02 15:37             ` Thierry Reding
2017-11-06 19:00             ` Bjorn Andersson
2017-11-07  1:15               ` Will Deacon
2017-09-29 14:08 ` [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS Kirill A. Shutemov
2017-09-29 14:08   ` Kirill A. Shutemov
2017-10-14  0:00   ` Nitin Gupta
2017-10-14  0:00     ` Nitin Gupta
2017-10-16 14:44     ` Kirill A. Shutemov
2017-10-16 14:44       ` Kirill A. Shutemov
2017-10-18 23:39       ` Nitin Gupta
2017-10-18 23:39         ` Nitin Gupta
2017-09-29 14:08 ` [PATCH 3/6] x86/kasan: Use the same shadow offset for 4- and 5-level paging Kirill A. Shutemov
2017-09-29 14:08   ` Kirill A. Shutemov
2017-10-20 12:28   ` [tip:x86/mm] " tip-bot for Andrey Ryabinin
2017-09-29 14:08 ` [PATCH 4/6] x86/xen: Provide pre-built page tables only for XEN_PV and XEN_PVH Kirill A. Shutemov
2017-09-29 14:08   ` Kirill A. Shutemov
2017-10-20 12:28   ` [tip:x86/mm] x86/xen: Provide pre-built page tables only for CONFIG_XEN_PV=y and CONFIG_XEN_PVH=y tip-bot for Kirill A. Shutemov
2017-09-29 14:08 ` [PATCH 5/6] x86/xen: Drop 5-level paging support code from XEN_PV code Kirill A. Shutemov
2017-09-29 14:08   ` Kirill A. Shutemov
2017-10-20 12:29   ` [tip:x86/mm] x86/xen: Drop 5-level paging support code from the " tip-bot for Kirill A. Shutemov
2017-09-29 14:08 ` [PATCH 6/6] x86/boot/compressed/64: Detect and handle 5-level paging at boot-time Kirill A. Shutemov
2017-09-29 14:08   ` Kirill A. Shutemov
2017-10-03  8:27 ` [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1 Kirill A. Shutemov
2017-10-03  8:27   ` Kirill A. Shutemov
2017-10-17 15:42   ` Kirill A. Shutemov
2017-10-17 15:42     ` Kirill A. Shutemov
2017-10-20  8:18     ` Ingo Molnar
2017-10-20  8:18       ` Ingo Molnar
2017-10-20  9:41       ` Kirill A. Shutemov
2017-10-20  9:41         ` Kirill A. Shutemov
2017-10-20 15:23         ` Ingo Molnar
2017-10-20 15:23           ` Ingo Molnar
2017-10-20 16:23           ` Kirill A. Shutemov
2017-10-20 16:23             ` Kirill A. Shutemov
2017-10-23 11:56             ` Ingo Molnar
2017-10-23 11:56               ` Ingo Molnar
2017-10-23 12:21               ` Kirill A. Shutemov
2017-10-23 12:21                 ` Kirill A. Shutemov
2017-10-23 12:40                 ` Ingo Molnar
2017-10-23 12:40                   ` Ingo Molnar
2017-10-23 12:48                   ` Kirill A. Shutemov
2017-10-23 12:48                     ` Kirill A. Shutemov
2017-10-24  9:40                     ` Ingo Molnar
2017-10-24  9:40                       ` Ingo Molnar
2017-10-24 11:38                       ` Kirill A. Shutemov
2017-10-24 11:38                         ` Kirill A. Shutemov
2017-10-24 12:47                         ` Ingo Molnar
2017-10-24 12:47                           ` Ingo Molnar
2017-10-24 13:12                           ` Kirill A. Shutemov
2017-10-24 13:12                             ` Kirill A. Shutemov
2017-10-26  7:37                             ` Ingo Molnar
2017-10-26  7:37                               ` Ingo Molnar
2017-10-26 14:40                               ` Kirill A. Shutemov
2017-10-26 14:40                                 ` Kirill A. Shutemov
2017-10-31  9:47                                 ` Ingo Molnar
2017-10-31  9:47                                   ` Ingo Molnar
2017-10-31 12:04                                   ` Kirill A. Shutemov
2017-10-31 12:04                                     ` Kirill A. Shutemov
2017-10-20  9:49       ` Minchan Kim
2017-10-20  9:49         ` Minchan Kim
2017-10-20 12:18         ` Kirill A. Shutemov
2017-10-20 12:18           ` Kirill A. Shutemov
2017-10-24 11:32     ` hpa
2017-10-24 11:32       ` hpa
2017-10-24 11:43       ` Kirill A. Shutemov
2017-10-24 11:43         ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.