linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Memory hotplug support for arm64 - complete patchset
@ 2017-04-11 14:54 Andrea Reale
  2017-04-11 14:54 ` [PATCH 1/5] arm64: memory-hotplug: Add MEMORY_HOTPLUG, MEMORY_HOTREMOVE, MEMORY_PROBE Andrea Reale
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Andrea Reale @ 2017-04-11 14:54 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: m.bielski, ar, scott.branden, will.deacon, qiuxishi, f.fainelli,
	linux-kernel

Hi all,

this is a follow up to the patch we previously released here [1].
We are publishing new memory hot-remove support for arm64 memory hotplug,
which complements our initial hot-add patch.

For convenience, we are reposting in a single thread all the 5 patches
that compose the full memory hot-plug support, which are:
- The initial patchset by Scott Branden, orginally released in [2]
  (2 patches).
- A second version of our hot-add patch, originally released in [1]
  (1 patch).
- New hot-remove support complementing the hot-add patch. (2 patches).

The patches should apply cleanly on Linux 4.11-rc6
(commit: 39da7c509acff13fc8cb12ec1bb20337c988ed36).

We have performed tests and experiments on physical and emulated boards
and we are collecting results (along with more information) at the public
Web page at [3].

Any comments or feedback are, as usual, very much appreciated.

[1] https://lkml.org/lkml/2016/12/14/188
[2] https://lkml.org/lkml/2016/12/1/811
[3] https://hotplug-tests.eu-gb.mybluemix.net

Andrea Reale (2):
  Hot-remove implementation for arm64
  Add "remove" probe driver for memory hot-remove

Maciej Bielski (1):
  Memory hotplug support for arm64 platform (v2)

Scott Branden (2):
  arm64: memory-hotplug: Add MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
    MEMORY_PROBE
  arm64: defconfig: enable MEMORY_HOTPLUG config options

 arch/arm64/Kconfig               |  11 ++
 arch/arm64/configs/defconfig     |   3 +
 arch/arm64/include/asm/mmu.h     |   7 +
 arch/arm64/include/asm/pgtable.h |  15 ++
 arch/arm64/mm/init.c             | 113 +++++++++++
 arch/arm64/mm/mmu.c              | 417 +++++++++++++++++++++++++++++++++++++++
 drivers/base/memory.c            |  34 +++-
 include/linux/memblock.h         |   1 +
 mm/memblock.c                    |  10 +
 9 files changed, 610 insertions(+), 1 deletion(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] arm64: memory-hotplug: Add MEMORY_HOTPLUG, MEMORY_HOTREMOVE, MEMORY_PROBE
  2017-04-11 14:54 [PATCH 0/5] Memory hotplug support for arm64 - complete patchset Andrea Reale
@ 2017-04-11 14:54 ` Andrea Reale
  2017-04-12  0:20   ` kbuild test robot
  2017-04-11 14:54 ` [PATCH 2/5] arm64: defconfig: enable MEMORY_HOTPLUG config options Andrea Reale
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Andrea Reale @ 2017-04-11 14:54 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: m.bielski, ar, scott.branden, will.deacon, qiuxishi, f.fainelli,
	linux-kernel

From: Scott Branden <scott.branden@broadcom.com>

Add memory-hotplug support for ARM64 platform.

This requires addition of
ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE config options.

MEMORY_PROBE config option is added to support
/sys/devices/system/memory/probe functionality.

In addition architecture specific arch_add_memory and
arch_remove memory management functions are added.

Signed-off-by: Scott Branden <scott.branden@broadcom.com>
---
 arch/arm64/Kconfig   | 10 ++++++++++
 arch/arm64/mm/init.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3741859..d930f73 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -620,6 +620,12 @@ config HOTPLUG_CPU
 	  Say Y here to experiment with turning CPUs off and on.  CPUs
 	  can be controlled through /sys/devices/system/cpu.
 
+config ARCH_ENABLE_MEMORY_HOTPLUG
+	def_bool y
+
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
@@ -694,6 +700,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
 
 source "mm/Kconfig"
 
+config ARCH_MEMORY_PROBE
+	def_bool y
+	depends on MEMORY_HOTPLUG
+
 config SECCOMP
 	bool "Enable seccomp to safely compute untrusted bytecode"
 	---help---
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index e19e065..4dcb8f7 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -541,3 +541,45 @@ static int __init register_mem_limit_dumper(void)
 	return 0;
 }
 __initcall(register_mem_limit_dumper);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+{
+	pg_data_t *pgdat;
+	struct zone *zone;
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	int ret;
+
+	pgdat = NODE_DATA(nid);
+
+	zone = pgdat->node_zones +
+		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+	ret = __add_pages(nid, zone, start_pfn, nr_pages);
+
+	if (ret)
+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
+			__func__, ret);
+
+	return ret;
+}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	if (ret)
+		pr_warn("%s: Problem encountered in __remove_pages() ret=%d\n",
+			__func__, ret);
+
+	return ret;
+}
+#endif
+#endif
+
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] arm64: defconfig: enable MEMORY_HOTPLUG config options
  2017-04-11 14:54 [PATCH 0/5] Memory hotplug support for arm64 - complete patchset Andrea Reale
  2017-04-11 14:54 ` [PATCH 1/5] arm64: memory-hotplug: Add MEMORY_HOTPLUG, MEMORY_HOTREMOVE, MEMORY_PROBE Andrea Reale
@ 2017-04-11 14:54 ` Andrea Reale
  2017-04-11 14:55 ` [PATCH 3/5] Memory hotplug support for arm64 platform (v2) Andrea Reale
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Andrea Reale @ 2017-04-11 14:54 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: m.bielski, ar, scott.branden, will.deacon, qiuxishi, f.fainelli,
	linux-kernel

From: Scott Branden <scott.branden@broadcom.com>

Enable memory hotplug config options to add/remove memory.

Signed-off-by: Scott Branden <scott.branden@broadcom.com>
---
 arch/arm64/configs/defconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 7c48028..80f48b0 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -76,6 +76,9 @@ CONFIG_ARM64_VA_BITS_48=y
 CONFIG_SCHED_MC=y
 CONFIG_NUMA=y
 CONFIG_PREEMPT=y
+CONFIG_MEMORY_HOTPLUG=y
+CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
+CONFIG_MEMORY_HOTREMOVE=y
 CONFIG_KSM=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_CMA=y
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] Memory hotplug support for arm64 platform (v2)
  2017-04-11 14:54 [PATCH 0/5] Memory hotplug support for arm64 - complete patchset Andrea Reale
  2017-04-11 14:54 ` [PATCH 1/5] arm64: memory-hotplug: Add MEMORY_HOTPLUG, MEMORY_HOTREMOVE, MEMORY_PROBE Andrea Reale
  2017-04-11 14:54 ` [PATCH 2/5] arm64: defconfig: enable MEMORY_HOTPLUG config options Andrea Reale
@ 2017-04-11 14:55 ` Andrea Reale
  2017-04-11 15:58   ` Mark Rutland
  2017-04-11 14:55 ` [PATCH 4/5] Hot-remove implementation for arm64 Andrea Reale
  2017-04-11 14:56 ` [PATCH 5/5] Add "remove" probe driver for memory hot-remove Andrea Reale
  4 siblings, 1 reply; 18+ messages in thread
From: Andrea Reale @ 2017-04-11 14:55 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: m.bielski, ar, scott.branden, will.deacon, qiuxishi, f.fainelli,
	linux-kernel

From: Maciej Bielski <m.bielski@virtualopensystems.com>

This is a second and improved version of the patch previously released
in [3].

It builds on the work by Scott Branden [2] and, henceforth,
it needs to be applied on top of Scott's patches [2].
Comments are very welcome.

Changes from the original patchset and known issues:

- Compared to Scott's original patchset, this work adds the mapping of
  the new hotplugged pages into the kernel page tables. This is done by
  copying the old swapper_pg_dir over a new page, adding the new mappings,
  and then switching to the newly built pg_dir (see `hotplug_paging` in
  arch/arm64/mmu.c). There might be better ways to to this: suggestions
  are more than welcome.

- The stub function for `arch_remove_memory` has been removed for now; we
  are working in parallel on memory hot remove, and we plan to contribute
  it as a separate patch.

- Corresponding Kconfig flags have been added;

- Note that this patch does not work when NUMA is enabled; in fact,
  the function `memory_add_physaddr_to_nid` does not have an
  implementation when the NUMA flag is on: this function is supposed to
  return the nid the hotplugged memory should be associated with. However
  it is not really clear to us  yet what the semantics of this function
  in the context of a NUMA system should be. A quick and dirty fix would
  be to always attach to the first available NUMA node.

- In arch/arm64/mm/init.c `arch_add_memory`, we are doing a hack with the
  nomap memory block flags to satisfy preconditions and postconditions of
  `__add_pages` and postconditions of `arch_add_memory`. Compared to
  memory hotplug implementation for other architectures, the "issue"
  seems to be in the implemenation of `pfn_valid`. Suggestions on how
  to cleanly avoid this hack are welcome.

This patchset can be tested by starting the kernel with the `mem=X` flag, where
X is less than the total available physical memory and has to be multiple of
MIN_MEMORY_BLOCK_SIZE. We also tested it on a customised version of QEMU
capable to emulate physical hotplug on arm64 platform.

To enable the feature the CONFIG_MEMORY_HOTPLUG compilation flag
needs to be set to true. Then, after memory is physically hotplugged,
the standard two steps to make it available (as also documented in
Documentation/memory-hotplug.txt) are:

(1) Notify memory hot-add
 	echo '0xYY000000' > /sys/devices/system/memory/probe

where 0xYY000000 is the first physical address of the new memory section.

(2) Online new memory block(s)
    echo online > /sys/devices/system/memory/memoryXXX/state
    -- or --
    echo online_movable > /sys/devices/system/memory/memoryXXX/state

where XXX corresponds to the ids of newly added blocks.

Onlining can optionally be automatic at hot-add notification by enabling
the global flag:
	echo online > /sys/devices/system/memory/auto_online_blocks
or by setting the corresponding config flag in the kernel build.

Again, any comment is highly appreciated.

[1] https://lkml.org/lkml/2016/11/17/49
[2] https://lkml.org/lkml/2016/12/1/811
[3] https://lkml.org/lkml/2016/12/14/188

Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
---
 arch/arm64/Kconfig           |  4 +--
 arch/arm64/include/asm/mmu.h |  3 ++
 arch/arm64/mm/init.c         | 77 ++++++++++++++++++++++++++++++++++----------
 arch/arm64/mm/mmu.c          | 35 ++++++++++++++++++++
 include/linux/memblock.h     |  1 +
 mm/memblock.c                | 10 ++++++
 6 files changed, 110 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d930f73..fa71d94 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -621,9 +621,7 @@ config HOTPLUG_CPU
 	  can be controlled through /sys/devices/system/cpu.
 
 config ARCH_ENABLE_MEMORY_HOTPLUG
-	def_bool y
-
-config ARCH_ENABLE_MEMORY_HOTREMOVE
+    depends on !NUMA
 	def_bool y
 
 # Common NUMA Features
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 4761941..8eb31db 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -37,5 +37,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 			       unsigned long virt, phys_addr_t size,
 			       pgprot_t prot, bool page_mappings_only);
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#endif
 
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 4dcb8f7..259bb6e 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -549,37 +549,80 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long end_pfn = start_pfn + nr_pages;
+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+	unsigned long pfn;
 	int ret;
 
+	if (end_pfn > max_sparsemem_pfn) {
+		pr_err("end_pfn too big");
+		return -1;
+	}
+	hotplug_paging(start, size);
+
+	/*
+	 * Mark the first page in the range as unusable. This is needed
+	 * because __add_section (within __add_pages) wants pfn_valid
+	 * of it to be false, and in arm64 pfn falid is implemented by
+	 * just checking at the nomap flag for existing blocks.
+	 *
+	 * A small trick here is that __add_section() requires only
+	 * phys_start_pfn (that is the first pfn of a section) to be
+	 * invalid. Regardless of whether it was assumed (by the function
+	 * author) that all pfns within a section are either all valid
+	 * or all invalid, it allows to avoid looping twice (once here,
+	 * second when memblock_clear_nomap() is called) through all
+	 * pfns of the section and modify only one pfn. Thanks to that,
+	 * further, in __add_zone() only this very first pfn is skipped
+	 * and corresponding page is not flagged reserved. Therefore it
+	 * is enough to correct this setup only for it.
+	 *
+	 * When arch_add_memory() returns the walk_memory_range() function
+	 * is called and passed with online_memory_block() callback,
+	 * which execution finally reaches the memory_block_action()
+	 * function, where also only the first pfn of a memory block is
+	 * checked to be reserved. Above, it was first pfn of a section,
+	 * here it is a block but
+	 * (drivers/base/memory.c):
+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+	 * (include/linux/memory.h):
+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
+	 * so we can consider block and section equivalently
+	 */
+	memblock_mark_nomap(start, 1<<PAGE_SHIFT);
+
 	pgdat = NODE_DATA(nid);
 
 	zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 
-	if (ret)
-		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
-			__func__, ret);
-
-	return ret;
-}
+	/*
+	 * Make the pages usable after they have been added.
+	 * This will make pfn_valid return true
+	 */
+	memblock_clear_nomap(start, 1<<PAGE_SHIFT);
 
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
-{
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
-	int ret;
+	/*
+	 * This is a hack to avoid having to mix arch specific code
+	 * into arch independent code. SetPageReserved is supposed
+	 * to be called by __add_zone (within __add_section, within
+	 * __add_pages). However, when it is called there, it assumes that
+	 * pfn_valid returns true.  For the way pfn_valid is implemented
+	 * in arm64 (a check on the nomap flag), the only way to make
+	 * this evaluate true inside __add_zone is to clear the nomap
+	 * flags of blocks in architecture independent code.
+	 *
+	 * To avoid this, we set the Reserved flag here after we cleared
+	 * the nomap flag in the line above.
+	 */
+	SetPageReserved(pfn_to_page(start_pfn));
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
-		pr_warn("%s: Problem encountered in __remove_pages() ret=%d\n",
+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
 			__func__, ret);
 
 	return ret;
 }
 #endif
-#endif
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d28dbcf..8882187 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1,3 +1,4 @@
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 /*
  * Based on arch/arm/mm/mmu.c
  *
@@ -118,6 +119,7 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
 		phys_addr_t pte_phys;
 		BUG_ON(!pgtable_alloc);
 		pte_phys = pgtable_alloc();
+		pr_debug("Allocating PTE at %p\n", __va(pte_phys));
 		pte = pte_set_fixmap(pte_phys);
 		__pmd_populate(pmd, pte_phys, PMD_TYPE_TABLE);
 		pte_clear_fixmap();
@@ -158,6 +160,7 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		phys_addr_t pmd_phys;
 		BUG_ON(!pgtable_alloc);
 		pmd_phys = pgtable_alloc();
+		pr_debug("Allocating PMD at %p\n", __va(pmd_phys));
 		pmd = pmd_set_fixmap(pmd_phys);
 		__pud_populate(pud, pmd_phys, PUD_TYPE_TABLE);
 		pmd_clear_fixmap();
@@ -218,6 +221,7 @@ static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 		phys_addr_t pud_phys;
 		BUG_ON(!pgtable_alloc);
 		pud_phys = pgtable_alloc();
+		pr_debug("Allocating PUD at %p\n", __va(pud_phys));
 		__pgd_populate(pgd, pud_phys, PUD_TYPE_TABLE);
 	}
 	BUG_ON(pgd_bad(*pgd));
@@ -513,6 +517,37 @@ void __init paging_init(void)
 		      SWAPPER_DIR_SIZE - PAGE_SIZE);
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+/*
+ * hotplug_paging() is used by memory hotplug to build new page tables
+ * for hot added memory.
+ */
+void hotplug_paging(phys_addr_t start, phys_addr_t size)
+{
+
+	struct page *pg;
+	phys_addr_t pgd_phys = pgd_pgtable_alloc();
+	pgd_t *pgd = pgd_set_fixmap(pgd_phys);
+
+	memcpy(pgd, swapper_pg_dir, PAGE_SIZE);
+
+	__create_pgd_mapping(pgd, start, __phys_to_virt(start), size,
+		PAGE_KERNEL, pgd_pgtable_alloc, false);
+
+	cpu_replace_ttbr1(__va(pgd_phys));
+	memcpy(swapper_pg_dir, pgd, PAGE_SIZE);
+	cpu_replace_ttbr1(swapper_pg_dir);
+
+	pgd_clear_fixmap();
+
+	pg = phys_to_page(pgd_phys);
+	pgtable_page_dtor(pg);
+	__free_pages(pg, 0);
+}
+
+#endif
+
 /*
  * Check whether a kernel address is valid (derived from arch/x86/).
  */
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bdfc65a..e82daff 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -93,6 +93,7 @@ bool memblock_overlaps_region(struct memblock_type *type,
 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
 ulong choose_memblock_flags(void);
 
 /* Low level functions */
diff --git a/mm/memblock.c b/mm/memblock.c
index 696f06d..e9b0eaf 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -805,6 +805,16 @@ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
 }
 
 /**
+ * memblock_clear_nomap - Clear a flag of MEMBLOCK_NOMAP memory region
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ */
+int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
+}
+
+/**
  * __next_reserved_mem_region - next function for for_each_reserved_region()
  * @idx: pointer to u64 loop variable
  * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-11 14:54 [PATCH 0/5] Memory hotplug support for arm64 - complete patchset Andrea Reale
                   ` (2 preceding siblings ...)
  2017-04-11 14:55 ` [PATCH 3/5] Memory hotplug support for arm64 platform (v2) Andrea Reale
@ 2017-04-11 14:55 ` Andrea Reale
  2017-04-11 17:12   ` Mark Rutland
  2017-04-11 14:56 ` [PATCH 5/5] Add "remove" probe driver for memory hot-remove Andrea Reale
  4 siblings, 1 reply; 18+ messages in thread
From: Andrea Reale @ 2017-04-11 14:55 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: m.bielski, ar, scott.branden, will.deacon, qiuxishi, f.fainelli,
	linux-kernel

- arch_remove_memory interface
- kernel page tables cleanup
- vmemmap_free implementation for arm64

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 arch/arm64/Kconfig               |   3 +
 arch/arm64/include/asm/mmu.h     |   4 +
 arch/arm64/include/asm/pgtable.h |  15 ++
 arch/arm64/mm/init.c             |  32 +++-
 arch/arm64/mm/mmu.c              | 390 ++++++++++++++++++++++++++++++++++++++-
 5 files changed, 438 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fa71d94..83b8bb5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -624,6 +624,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
     depends on !NUMA
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 8eb31db..2cf2115 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -39,6 +39,10 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern void remove_pagetable(unsigned long start,
+	unsigned long end, bool direct);
+#endif
 #endif
 
 #endif
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0eef606..194cb3e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -399,6 +399,11 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
 	return pmd_val(pmd) & PHYS_MASK & (s32)PAGE_MASK;
 }
 
+static inline unsigned long pmd_page_vaddr(pmd_t pmd)
+{
+	return (unsigned long) __va(pmd_page_paddr(pmd));
+}
+
 /* Find an entry in the third-level page table. */
 #define pte_index(addr)		(((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
 
@@ -450,6 +455,11 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
 	return pud_val(pud) & PHYS_MASK & (s32)PAGE_MASK;
 }
 
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+	return (unsigned long) __va(pud_page_paddr(pud));
+}
+
 /* Find an entry in the second-level page table. */
 #define pmd_index(addr)		(((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
 
@@ -502,6 +512,11 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
 	return pgd_val(pgd) & PHYS_MASK & (s32)PAGE_MASK;
 }
 
+static inline unsigned long pgd_page_vaddr(pgd_t pgd)
+{
+	return (unsigned long) __va(pgd_page_paddr(pgd));
+}
+
 /* Find an entry in the frst-level page table. */
 #define pud_index(addr)		(((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
 
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 259bb6e..c12983b 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -551,7 +551,6 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
-	unsigned long pfn;
 	int ret;
 
 	if (end_pfn > max_sparsemem_pfn) {
@@ -624,5 +623,34 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 
 	return ret;
 }
-#endif
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static void kernel_physical_mapping_remove(unsigned long start,
+	unsigned long end)
+{
+	start = (unsigned long)__va(start);
+	end = (unsigned long)__va(end);
+
+	remove_pagetable(start, end, true);
+
+}
+
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct page *page = pfn_to_page(start_pfn);
+	struct zone *zone;
+	int ret = 0;
+
+	zone = page_zone(page);
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	WARN_ON_ONCE(ret);
+
+	kernel_physical_mapping_remove(start, start + size);
+
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8882187..e129d7c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1,4 +1,3 @@
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 /*
  * Based on arch/arm/mm/mmu.c
  *
@@ -24,6 +23,7 @@
 #include <linux/errno.h>
 #include <linux/init.h>
 #include <linux/libfdt.h>
+#include <linux/memremap.h>
 #include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
@@ -119,7 +119,6 @@ static void alloc_init_pte(pmd_t *pmd, unsigned long addr,
 		phys_addr_t pte_phys;
 		BUG_ON(!pgtable_alloc);
 		pte_phys = pgtable_alloc();
-		pr_debug("Allocating PTE at %p\n", __va(pte_phys));
 		pte = pte_set_fixmap(pte_phys);
 		__pmd_populate(pmd, pte_phys, PMD_TYPE_TABLE);
 		pte_clear_fixmap();
@@ -160,7 +159,6 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 		phys_addr_t pmd_phys;
 		BUG_ON(!pgtable_alloc);
 		pmd_phys = pgtable_alloc();
-		pr_debug("Allocating PMD at %p\n", __va(pmd_phys));
 		pmd = pmd_set_fixmap(pmd_phys);
 		__pud_populate(pud, pmd_phys, PUD_TYPE_TABLE);
 		pmd_clear_fixmap();
@@ -221,7 +219,6 @@ static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 		phys_addr_t pud_phys;
 		BUG_ON(!pgtable_alloc);
 		pud_phys = pgtable_alloc();
-		pr_debug("Allocating PUD at %p\n", __va(pud_phys));
 		__pgd_populate(pgd, pud_phys, PUD_TYPE_TABLE);
 	}
 	BUG_ON(pgd_bad(*pgd));
@@ -546,7 +543,389 @@ void hotplug_paging(phys_addr_t start, phys_addr_t size)
 	__free_pages(pg, 0);
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+#define PAGE_INUSE 0xFD
+
+static void  free_pagetable(struct page *page, int order, bool direct)
+{
+	unsigned long magic;
+	unsigned int nr_pages = 1 << order;
+	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
+
+	if (altmap) {
+		vmem_altmap_free(altmap, nr_pages);
+		return;
+	}
+
+	/* bootmem page has reserved flag */
+	if (PageReserved(page)) {
+		__ClearPageReserved(page);
+
+		magic = (unsigned long)page->lru.next;
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+			while (nr_pages--)
+				put_page_bootmem(page++);
+		} else {
+			while (nr_pages--)
+				free_reserved_page(page++);
+		}
+	} else {
+		/*
+		 * Only direct pagetable allocation (those allocated via
+		 * hotplug) call the pgtable_page_ctor; vmemmap pgtable
+		 * allocations don't.
+		 */
+		if (direct)
+			pgtable_page_dtor(page);
+
+		free_pages((unsigned long)page_address(page), order);
+	}
+}
+
+static void free_pte_table(pmd_t *pmd, bool direct)
+{
+	pte_t *pte_start, *pte;
+	struct page *page;
+	int i;
+
+	pte_start =  (pte_t *) pmd_page_vaddr(*pmd);
+	/* Check if there is no valid entry in the PMD */
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (!pte_none(*pte))
+			return;
+	}
+
+	page = pmd_page(*pmd);
+
+	free_pagetable(page, 0, direct);
+
+	/*
+	 * This spin lock could be only taken in _pte_aloc_kernel
+	 * in mm/memory.c and nowhere else (for arm64). Not sure if
+	 * the function above can be called concurrently. In doubt,
+	 * I am living it here for now, but it probably can be removed
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static void free_pmd_table(pud_t *pud, bool direct)
+{
+	pmd_t *pmd_start, *pmd;
+	struct page *page;
+	int i;
+
+	pmd_start = (pmd_t *) pud_page_vaddr(*pud);
+	/* Check if there is no valid entry in the PMD */
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (!pmd_none(*pmd))
+			return;
+	}
+
+	page = pud_page(*pud);
+
+	free_pagetable(page, 0, direct);
+
+	/*
+	 * This spin lock could be only taken in _pte_aloc_kernel
+	 * in mm/memory.c and nowhere else (for arm64). Not sure if
+	 * the function above can be called concurrently. In doubt,
+	 * I am living it here for now, but it probably can be removed
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+/*
+ * When the PUD is folded on the PGD (three levels of paging),
+ * there's no need to free PUDs
+ */
+#if CONFIG_PGTABLE_LEVELS > 3
+static void free_pud_table(pgd_t *pgd, bool direct)
+{
+	pud_t *pud_start, *pud;
+	struct page *page;
+	int i;
+
+	pud_start = (pud_t *) pgd_page_vaddr(*pgd);
+	/* Check if there is no valid entry in the PUD */
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (!pud_none(*pud))
+			return;
+	}
+
+	page = pgd_page(*pgd);
+
+	free_pagetable(page, 0, direct);
+
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+#endif
+
+static void remove_pte_table(pte_t *pte, unsigned long addr,
+	unsigned long end, bool direct)
+{
+	unsigned long next;
+	void *page_addr;
+
+	for (; addr < end; addr = next, pte++) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		if (!pte_present(*pte))
+			continue;
+
+		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
+			/*
+			 * Do not free direct mapping pages since they were
+			 * freed when offlining, or simplely not in use.
+			 */
+			if (!direct)
+				free_pagetable(pte_page(*pte), 0, direct);
+
+			/*
+			 * This spin lock could be only
+			 * taken in _pte_aloc_kernel in
+			 * mm/memory.c and nowhere else
+			 * (for arm64). Not sure if the
+			 * function above can be called
+			 * concurrently. In doubt,
+			 * I am living it here for now,
+			 * but it probably can be removed.
+			 */
+			spin_lock(&init_mm.page_table_lock);
+			pte_clear(&init_mm, addr, pte);
+			spin_unlock(&init_mm.page_table_lock);
+		} else {
+			/*
+			 * If we are here, we are freeing vmemmap pages since
+			 * direct mapped memory ranges to be freed are aligned.
+			 *
+			 * If we are not removing the whole page, it means
+			 * other page structs in this page are being used and
+			 * we canot remove them. So fill the unused page_structs
+			 * with 0xFD, and remove the page when it is wholly
+			 * filled with 0xFD.
+			 */
+			memset((void *)addr, PAGE_INUSE, next - addr);
+
+			page_addr = page_address(pte_page(*pte));
+			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
+				free_pagetable(pte_page(*pte), 0, direct);
+
+				/*
+				 * This spin lock could be only
+				 * taken in _pte_aloc_kernel in
+				 * mm/memory.c and nowhere else
+				 * (for arm64). Not sure if the
+				 * function above can be called
+				 * concurrently. In doubt,
+				 * I am living it here for now,
+				 * but it probably can be removed.
+				 */
+				spin_lock(&init_mm.page_table_lock);
+				pte_clear(&init_mm, addr, pte);
+				spin_unlock(&init_mm.page_table_lock);
+			}
+		}
+	}
+
+	// I am adding this flush here in simmetry to the x86 code.
+	// Why do I need to call it here and not in remove_p[mu]d
+	flush_tlb_all();
+}
+
+static void remove_pmd_table(pmd_t *pmd, unsigned long addr,
+	unsigned long end, bool direct)
+{
+	unsigned long next;
+	void *page_addr;
+	pte_t *pte;
+
+	for (; addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(*pmd))
+			continue;
+
+		// check if we are using 2MB section mappings
+		if (pmd_sect(*pmd)) {
+			if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
+				if (!direct) {
+					free_pagetable(pmd_page(*pmd),
+						get_order(PMD_SIZE), direct);
+				}
+				/*
+				 * This spin lock could be only
+				 * taken in _pte_aloc_kernel in
+				 * mm/memory.c and nowhere else
+				 * (for arm64). Not sure if the
+				 * function above can be called
+				 * concurrently. In doubt,
+				 * I am living it here for now,
+				 * but it probably can be removed.
+				 */
+				spin_lock(&init_mm.page_table_lock);
+				pmd_clear(pmd);
+				spin_unlock(&init_mm.page_table_lock);
+			} else {
+				/* If here, we are freeing vmemmap pages. */
+				memset((void *)addr, PAGE_INUSE, next - addr);
+
+				page_addr = page_address(pmd_page(*pmd));
+				if (!memchr_inv(page_addr, PAGE_INUSE,
+						PMD_SIZE)) {
+					free_pagetable(pmd_page(*pmd),
+						get_order(PMD_SIZE), direct);
+
+					/*
+					 * This spin lock could be only
+					 * taken in _pte_aloc_kernel in
+					 * mm/memory.c and nowhere else
+					 * (for arm64). Not sure if the
+					 * function above can be called
+					 * concurrently. In doubt,
+					 * I am living it here for now,
+					 * but it probably can be removed.
+					 */
+					spin_lock(&init_mm.page_table_lock);
+					pmd_clear(pmd);
+					spin_unlock(&init_mm.page_table_lock);
+				}
+			}
+			continue;
+		}
+
+		BUG_ON(!pmd_table(*pmd));
+
+		pte = pte_offset_map(pmd, addr);
+		remove_pte_table(pte, addr, next, direct);
+		free_pte_table(pmd, direct);
+	}
+}
+
+static void remove_pud_table(pud_t *pud, unsigned long addr,
+	unsigned long end, bool direct)
+{
+	unsigned long next;
+	pmd_t *pmd;
+	void *page_addr;
+
+	for (; addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+		if (!pud_present(*pud))
+			continue;
+		/*
+		 * If we are using 4K granules, check if we are using
+		 * 1GB section mapping.
+		 */
+		if (pud_sect(*pud)) {
+			if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
+				if (!direct) {
+					free_pagetable(pud_page(*pud),
+						get_order(PUD_SIZE), direct);
+				}
+
+				/*
+				 * This spin lock could be only
+				 * taken in _pte_aloc_kernel in
+				 * mm/memory.c and nowhere else
+				 * (for arm64). Not sure if the
+				 * function above can be called
+				 * concurrently. In doubt,
+				 * I am living it here for now,
+				 * but it probably can be removed.
+				 */
+				spin_lock(&init_mm.page_table_lock);
+				pud_clear(pud);
+				spin_unlock(&init_mm.page_table_lock);
+			} else {
+				/* If here, we are freeing vmemmap pages. */
+				memset((void *)addr, PAGE_INUSE, next - addr);
+
+				page_addr = page_address(pud_page(*pud));
+				if (!memchr_inv(page_addr, PAGE_INUSE,
+						PUD_SIZE)) {
+
+					free_pagetable(pud_page(*pud),
+						get_order(PUD_SIZE), direct);
+
+					/*
+					 * This spin lock could be only
+					 * taken in _pte_aloc_kernel in
+					 * mm/memory.c and nowhere else
+					 * (for arm64). Not sure if the
+					 * function above can be called
+					 * concurrently. In doubt,
+					 * I am living it here for now,
+					 * but it probably can be removed.
+					 */
+					spin_lock(&init_mm.page_table_lock);
+					pud_clear(pud);
+					spin_unlock(&init_mm.page_table_lock);
+				}
+			}
+			continue;
+		}
+
+		BUG_ON(!pud_table(*pud));
+
+		pmd = pmd_offset(pud, addr);
+		remove_pmd_table(pmd, addr, next, direct);
+		free_pmd_table(pud, direct);
+	}
+}
+
+void remove_pagetable(unsigned long start, unsigned long end, bool direct)
+{
+	unsigned long next;
+	unsigned long addr;
+	pgd_t *pgd;
+	pud_t *pud;
+
+	for (addr = start; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd))
+			continue;
+
+		pud = pud_offset(pgd, addr);
+		remove_pud_table(pud, addr, next, direct);
+		/*
+		 * When the PUD is folded on the PGD (three levels of paging),
+		 * I did already clear the PMD page in free_pmd_table,
+		 * and reset the corresponding PGD==PUD entry.
+		 */
+#if CONFIG_PGTABLE_LEVELS > 3
+		free_pud_table(pgd, direct);
 #endif
+	}
+
+	flush_tlb_all();
+}
+
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_MEMORY_HOTPLUG */
 
 /*
  * Check whether a kernel address is valid (derived from arch/x86/).
@@ -629,6 +1008,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 #endif	/* CONFIG_ARM64_64K_PAGES */
 void vmemmap_free(unsigned long start, unsigned long end)
 {
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	remove_pagetable(start, end, false);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] Add "remove" probe driver for memory hot-remove
  2017-04-11 14:54 [PATCH 0/5] Memory hotplug support for arm64 - complete patchset Andrea Reale
                   ` (3 preceding siblings ...)
  2017-04-11 14:55 ` [PATCH 4/5] Hot-remove implementation for arm64 Andrea Reale
@ 2017-04-11 14:56 ` Andrea Reale
  4 siblings, 0 replies; 18+ messages in thread
From: Andrea Reale @ 2017-04-11 14:56 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: m.bielski, ar, scott.branden, will.deacon, qiuxishi, f.fainelli,
	linux-kernel

Allows to remove sections of memory that have been previously offlined
from userspace.

This is symmetric to the "memory probe" interface, as described in
Documentation/memory-hotplug.txt. It can be used to manually notify the
OS that one memory section has been removed.

Please, remind that a memory section can only be removed after it has
been logically off-lined; trying to remove a section which has not been
previously off-lined is not supported and will have undefined (but most
likely very bad) behaviour.

To offline a section one can:
# echo offline > /sys/devices/system/memory/memoryXXX/state

To use the remove interface to remove the section:
# echo $((0xYYYYYY)) > /sys/devices/system/memory/remove

where 0xYYYYYY is the physical address of the memory section to remove.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index cc4f1d0..8e033b5 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -511,7 +511,36 @@ static DEVICE_ATTR(auto_online_blocks, 0644, show_auto_online_blocks,
 }
 
 static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
-#endif
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static ssize_t
+memory_remove_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	u64 phys_addr;
+	int nid, ret;
+	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
+
+	ret = kstrtoull(buf, 0, &phys_addr);
+	if (ret)
+		return ret;
+
+	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
+		return -EINVAL;
+
+	nid = memory_add_physaddr_to_nid(phys_addr);
+	ret = lock_device_hotplug_sysfs();
+	if (ret)
+		return ret;
+
+	remove_memory(nid, phys_addr,
+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+	unlock_device_hotplug();
+	return count;
+}
+static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_ARCH_MEMORY_PROBE */
 
 #ifdef CONFIG_MEMORY_FAILURE
 /*
@@ -776,6 +805,9 @@ bool is_memblock_offlined(struct memory_block *mem)
 static struct attribute *memory_root_attrs[] = {
 #ifdef CONFIG_ARCH_MEMORY_PROBE
 	&dev_attr_probe.attr,
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	&dev_attr_remove.attr,
+#endif
 #endif
 
 #ifdef CONFIG_MEMORY_FAILURE
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] Memory hotplug support for arm64 platform (v2)
  2017-04-11 14:55 ` [PATCH 3/5] Memory hotplug support for arm64 platform (v2) Andrea Reale
@ 2017-04-11 15:58   ` Mark Rutland
  2017-04-24 16:44     ` Maciej Bielski
  2017-04-24 17:35     ` Maciej Bielski
  0 siblings, 2 replies; 18+ messages in thread
From: Mark Rutland @ 2017-04-11 15:58 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, f.fainelli, m.bielski, scott.branden,
	will.deacon, linux-kernel, qiuxishi

Hi,

On Tue, Apr 11, 2017 at 03:55:22PM +0100, Andrea Reale wrote:
> From: Maciej Bielski <m.bielski@virtualopensystems.com>
> 
> This is a second and improved version of the patch previously released
> in [3].
> 
> It builds on the work by Scott Branden [2] and, henceforth,
> it needs to be applied on top of Scott's patches [2].
> Comments are very welcome.
> 
> Changes from the original patchset and known issues:
> 
> - Compared to Scott's original patchset, this work adds the mapping of
>   the new hotplugged pages into the kernel page tables. This is done by
>   copying the old swapper_pg_dir over a new page, adding the new mappings,
>   and then switching to the newly built pg_dir (see `hotplug_paging` in
>   arch/arm64/mmu.c). There might be better ways to to this: suggestions
>   are more than welcome.

For this reply, I'm just going to focus on the PGD replacement.

It's not clear to me if we need to swap the PGD, since everything we do
here is strictly additive and non-conflicting, and it should be safe to
add things to the swapper_pg_dir live.

However, assuming there's something I've missed, I have some comments
below.

[...]

> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/*
> + * hotplug_paging() is used by memory hotplug to build new page tables
> + * for hot added memory.
> + */
> +void hotplug_paging(phys_addr_t start, phys_addr_t size)
> +{
> +
> +	struct page *pg;
> +	phys_addr_t pgd_phys = pgd_pgtable_alloc();
> +	pgd_t *pgd = pgd_set_fixmap(pgd_phys);
> +
> +	memcpy(pgd, swapper_pg_dir, PAGE_SIZE);

s/PAGE_SIZE/PGD_SIZE/ (and below, too).

See commit 12f043ff2b28fa64 ("arm64: fix warning about swapper_pg_dir
overflow").

> +
> +	__create_pgd_mapping(pgd, start, __phys_to_virt(start), size,
> +		PAGE_KERNEL, pgd_pgtable_alloc, false);
> +
> +	cpu_replace_ttbr1(__va(pgd_phys));

Other CPUs may be online at this point, and cpu_replace_ttbr1() was only
intended for UP operation. Other CPUs will still have swapper_pg_dir in
their TTBR1_EL1 at this point in time...

> +	memcpy(swapper_pg_dir, pgd, PAGE_SIZE);

... which this will alter, in an unsafe fashion.

The other CPUs are live, and might be altering swapper. e.g. one might
be using the fixmap to alter code. We will need some stop_machine()-like
machinery here to synchronise with other CPUs, ensuring that they don't
have swapper_pg_dir live.

Unfortunately, we can't change other to the temporary pgdir in a cross
call, then fix things up, then do another cross call. If any exception
is taken when we're in the temporary pgdir, __uaccess_ttbr0_disable will
install junk into TTBR0, and we risk the usual set of pain junk TLBs
entail.

We appear to have a latent bug at boot time along those lines, when
setting up the page tables and initialising KASAN. I'll see about
cleaning that up.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-11 14:55 ` [PATCH 4/5] Hot-remove implementation for arm64 Andrea Reale
@ 2017-04-11 17:12   ` Mark Rutland
  2017-04-14 14:01     ` Andrea Reale
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Rutland @ 2017-04-11 17:12 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, f.fainelli, m.bielski, scott.branden,
	will.deacon, linux-kernel, qiuxishi

Hi,

On Tue, Apr 11, 2017 at 03:55:42PM +0100, Andrea Reale wrote:
> +static inline unsigned long pmd_page_vaddr(pmd_t pmd)
> +{
> +	return (unsigned long) __va(pmd_page_paddr(pmd));
> +}
> +
>  /* Find an entry in the third-level page table. */
>  #define pte_index(addr)		(((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
>  
> @@ -450,6 +455,11 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
>  	return pud_val(pud) & PHYS_MASK & (s32)PAGE_MASK;
>  }
>  
> +static inline unsigned long pud_page_vaddr(pud_t pud)
> +{
> +	return (unsigned long) __va(pud_page_paddr(pud));
> +}
> +
>  /* Find an entry in the second-level page table. */
>  #define pmd_index(addr)		(((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
>  
> @@ -502,6 +512,11 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
>  	return pgd_val(pgd) & PHYS_MASK & (s32)PAGE_MASK;
>  }
>  
> +static inline unsigned long pgd_page_vaddr(pgd_t pgd)
> +{
> +	return (unsigned long) __va(pgd_page_paddr(pgd));
> +}
> +

These duplicate the existing p*d_offset*() functions, and I do not think
they are necessary. More on that below.

[...]

> @@ -551,7 +551,6 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	unsigned long end_pfn = start_pfn + nr_pages;
>  	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> -	unsigned long pfn;
>  	int ret;
>  
>  	if (end_pfn > max_sparsemem_pfn) {

Should this have been part of a prior patch?

This patch doesn't remove any users of this variable.

[...]

> +static void  free_pagetable(struct page *page, int order, bool direct)

This "direct" parameter needs a better name, and a description in a
comment block above this function.

> +{
> +	unsigned long magic;
> +	unsigned int nr_pages = 1 << order;
> +	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
> +
> +	if (altmap) {
> +		vmem_altmap_free(altmap, nr_pages);
> +		return;
> +	}
> +
> +	/* bootmem page has reserved flag */
> +	if (PageReserved(page)) {
> +		__ClearPageReserved(page);
> +
> +		magic = (unsigned long)page->lru.next;
> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +			while (nr_pages--)
> +				put_page_bootmem(page++);
> +		} else {
> +			while (nr_pages--)
> +				free_reserved_page(page++);
> +		}
> +	} else {
> +		/*
> +		 * Only direct pagetable allocation (those allocated via
> +		 * hotplug) call the pgtable_page_ctor; vmemmap pgtable
> +		 * allocations don't.
> +		 */
> +		if (direct)
> +			pgtable_page_dtor(page);
> +
> +		free_pages((unsigned long)page_address(page), order);
> +	}
> +}

This largely looks like a copy of the x86 code. Why can that not be
factored out to an arch-neutral location?

> +
> +static void free_pte_table(pmd_t *pmd, bool direct)
> +{
> +	pte_t *pte_start, *pte;
> +	struct page *page;
> +	int i;
> +
> +	pte_start =  (pte_t *) pmd_page_vaddr(*pmd);
> +	/* Check if there is no valid entry in the PMD */
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (!pte_none(*pte))
> +			return;
> +	}

If we must walk the tables in this way, please use the existing pattern
from arch/arm64/mm/dump.c.

e.g. 

	pte_t *pte;

	pte = pte_offset_kernel(pmd, 0UL);
	for (i = 0; i < PTRS_PER_PTE; i++, pte++)
		if (!pte_none)
			return;

> +
> +	page = pmd_page(*pmd);
> +
> +	free_pagetable(page, 0, direct);

The page has been freed here, and may be subject to arbitrary
modification...

> +
> +	/*
> +	 * This spin lock could be only taken in _pte_aloc_kernel
> +	 * in mm/memory.c and nowhere else (for arm64). Not sure if
> +	 * the function above can be called concurrently. In doubt,
> +	 * I am living it here for now, but it probably can be removed
> +	 */
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);

... but we only remove it from the page tables here, so the page table
walkers can see junk in the tables, were the page reused in the mean
timer.

After clearing the PMD, it needs to be cleared from TLBs. We allow
partial walks to be cached, so the TLBs may still start walking this
entry or beyond.

> +	spin_unlock(&init_mm.page_table_lock);
> +}

The same comments apply to all the free_p*d_table() functions, so I
shan't repeat them.

[...]

> +static void remove_pte_table(pte_t *pte, unsigned long addr,
> +	unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	void *page_addr;
> +
> +	for (; addr < end; addr = next, pte++) {
> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> +		if (next > end)
> +			next = end;

Please use the usual p*d_addr_end() functions. See alloc_init_pmd() and
friends in arch/arm64/mm/mmu.c for examples of how to use them.

> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {

When would those addresses *not* be page-aligned? By construction, next
must be. Unplugging partial pages of memory makes no sense, so surely
addr is page-aligned when passed in?

> +			/*
> +			 * Do not free direct mapping pages since they were
> +			 * freed when offlining, or simplely not in use.
> +			 */
> +			if (!direct)
> +				free_pagetable(pte_page(*pte), 0, direct);
> +
> +			/*
> +			 * This spin lock could be only
> +			 * taken in _pte_aloc_kernel in
> +			 * mm/memory.c and nowhere else
> +			 * (for arm64). Not sure if the
> +			 * function above can be called
> +			 * concurrently. In doubt,
> +			 * I am living it here for now,
> +			 * but it probably can be removed.
> +			 */
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);
> +		} else {
> +			/*
> +			 * If we are here, we are freeing vmemmap pages since
> +			 * direct mapped memory ranges to be freed are aligned.
> +			 *
> +			 * If we are not removing the whole page, it means
> +			 * other page structs in this page are being used and
> +			 * we canot remove them. So fill the unused page_structs
> +			 * with 0xFD, and remove the page when it is wholly
> +			 * filled with 0xFD.
> +			 */
> +			memset((void *)addr, PAGE_INUSE, next - addr);

What's special about 0xFD?

Why do we need to mess with the page array in this manner? Why can't we
detect when a range is free by querying memblock, for example?

> +
> +			page_addr = page_address(pte_page(*pte));
> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> +				free_pagetable(pte_page(*pte), 0, direct);
> +
> +				/*
> +				 * This spin lock could be only
> +				 * taken in _pte_aloc_kernel in
> +				 * mm/memory.c and nowhere else
> +				 * (for arm64). Not sure if the
> +				 * function above can be called
> +				 * concurrently. In doubt,
> +				 * I am living it here for now,
> +				 * but it probably can be removed.
> +				 */
> +				spin_lock(&init_mm.page_table_lock);
> +				pte_clear(&init_mm, addr, pte);
> +				spin_unlock(&init_mm.page_table_lock);

This logic appears to be duplicated with the free_*_table functions, and
looks incredibly suspicious.

To me, it doesn't make sense that we'd need the lock *only* to alter the
leaf entries.

> +			}
> +		}
> +	}
> +
> +	// I am adding this flush here in simmetry to the x86 code.
> +	// Why do I need to call it here and not in remove_p[mu]d
> +	flush_tlb_all();

If the page tables weren't freed until this point, it might be that this
is just amortizing the cost of removing them from the TLBs.

Given that they're freed first, this makes no sense to me.

> +}

The same commenst apply to all the remove_p*d_table() functions, so I
shan't repeat them.

> +
> +static void remove_pmd_table(pmd_t *pmd, unsigned long addr,
> +	unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	void *page_addr;
> +	pte_t *pte;
> +
> +	for (; addr < end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +
> +		if (!pmd_present(*pmd))
> +			continue;
> +
> +		// check if we are using 2MB section mappings
> +		if (pmd_sect(*pmd)) {
> +			if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {

Surely you're intending to check if you can free the whole pmd? i.e.
that addr and next are pmd-aligned?

Can we ever be in a situation where we're requested to free a partial
pmd that could be section mapped?

If that's the case, we'll *need* to split the pmd, which we can't do on
live page tables.

> +				if (!direct) {
> +					free_pagetable(pmd_page(*pmd),
> +						get_order(PMD_SIZE), direct);
> +				}
> +				/*
> +				 * This spin lock could be only
> +				 * taken in _pte_aloc_kernel in
> +				 * mm/memory.c and nowhere else
> +				 * (for arm64). Not sure if the
> +				 * function above can be called
> +				 * concurrently. In doubt,
> +				 * I am living it here for now,
> +				 * but it probably can be removed.
> +				 */
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
> +			} else {
> +				/* If here, we are freeing vmemmap pages. */
> +				memset((void *)addr, PAGE_INUSE, next - addr);
> +
> +				page_addr = page_address(pmd_page(*pmd));
> +				if (!memchr_inv(page_addr, PAGE_INUSE,
> +						PMD_SIZE)) {
> +					free_pagetable(pmd_page(*pmd),
> +						get_order(PMD_SIZE), direct);
> +
> +					/*
> +					 * This spin lock could be only
> +					 * taken in _pte_aloc_kernel in
> +					 * mm/memory.c and nowhere else
> +					 * (for arm64). Not sure if the
> +					 * function above can be called
> +					 * concurrently. In doubt,
> +					 * I am living it here for now,
> +					 * but it probably can be removed.
> +					 */
> +					spin_lock(&init_mm.page_table_lock);
> +					pmd_clear(pmd);
> +					spin_unlock(&init_mm.page_table_lock);
> +				}

I don't think this is correct.

If we're getting rid of a partial pmd, we *must* split the pmd.
Otherwise, we're leaving bits mapped that should not be. If we split the
pmd, we can free the individual pages as we would for a non-section
mapping.

As above, we can't split block entries within live tables, so that will
be painful at best.

If we can't split a pmd, hen we cannot free a partial pmd, and will need
to reject request to do so.

The same comments (with s/pmu/pud/, etc) apply for the higher level
remove_p*d_table functions.

[...]

> +void remove_pagetable(unsigned long start, unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	unsigned long addr;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +
> +	for (addr = start; addr < end; addr = next) {
> +		next = pgd_addr_end(addr, end);
> +
> +		pgd = pgd_offset_k(addr);
> +		if (pgd_none(*pgd))
> +			continue;
> +
> +		pud = pud_offset(pgd, addr);
> +		remove_pud_table(pud, addr, next, direct);
> +		/*
> +		 * When the PUD is folded on the PGD (three levels of paging),
> +		 * I did already clear the PMD page in free_pmd_table,
> +		 * and reset the corresponding PGD==PUD entry.
> +		 */
> +#if CONFIG_PGTABLE_LEVELS > 3
> +		free_pud_table(pgd, direct);
>  #endif

This looks suspicious. Shouldn't we have a similar check for PMD, to
cater for CONFIG_PGTABLE_LEVELS == 2? e.g. 64K pages, 42-bit VA.

We should be able to hide this distinction in helpers for both cases.

> +	}
> +
> +	flush_tlb_all();

This is too late to be useful.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] arm64: memory-hotplug: Add MEMORY_HOTPLUG, MEMORY_HOTREMOVE, MEMORY_PROBE
  2017-04-11 14:54 ` [PATCH 1/5] arm64: memory-hotplug: Add MEMORY_HOTPLUG, MEMORY_HOTREMOVE, MEMORY_PROBE Andrea Reale
@ 2017-04-12  0:20   ` kbuild test robot
  0 siblings, 0 replies; 18+ messages in thread
From: kbuild test robot @ 2017-04-12  0:20 UTC (permalink / raw)
  To: Andrea Reale
  Cc: kbuild-all, linux-arm-kernel, m.bielski, ar, scott.branden,
	will.deacon, qiuxishi, f.fainelli, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3408 bytes --]

Hi Scott,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.11-rc6]
[cannot apply to arm64/for-next/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Andrea-Reale/Memory-hotplug-support-for-arm64-complete-patchset/20170412-022242
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm64 

Note: the linux-review/Andrea-Reale/Memory-hotplug-support-for-arm64-complete-patchset/20170412-022242 HEAD bffdda51b562ca7bb9baf7642e14c8d03d44602d builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `memory_probe_store':
>> drivers/base/memory.c:501: undefined reference to `memory_add_physaddr_to_nid'
   drivers/base/memory.c:501:(.text+0x1a5ef8): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `memory_add_physaddr_to_nid'

vim +501 drivers/base/memory.c

3947be19 Dave Hansen         2005-10-29  485  #ifdef CONFIG_ARCH_MEMORY_PROBE
3947be19 Dave Hansen         2005-10-29  486  static ssize_t
10fbcf4c Kay Sievers         2011-12-21  487  memory_probe_store(struct device *dev, struct device_attribute *attr,
28812fe1 Andi Kleen          2010-01-05  488  		   const char *buf, size_t count)
3947be19 Dave Hansen         2005-10-29  489  {
3947be19 Dave Hansen         2005-10-29  490  	u64 phys_addr;
cb5490a5 John Allen          2016-01-14  491  	int nid, ret;
61b94fea Anton Blanchard     2011-09-15  492  	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
3947be19 Dave Hansen         2005-10-29  493  
b69deb2b Zhang Zhen          2014-08-06  494  	ret = kstrtoull(buf, 0, &phys_addr);
b69deb2b Zhang Zhen          2014-08-06  495  	if (ret)
b69deb2b Zhang Zhen          2014-08-06  496  		return ret;
3947be19 Dave Hansen         2005-10-29  497  
61b94fea Anton Blanchard     2011-09-15  498  	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
61b94fea Anton Blanchard     2011-09-15  499  		return -EINVAL;
61b94fea Anton Blanchard     2011-09-15  500  
bc02af93 Yasunori Goto       2006-06-27 @501  	nid = memory_add_physaddr_to_nid(phys_addr);
6add7cd6 Nathan Fontenot     2011-01-31  502  	ret = add_memory(nid, phys_addr,
cb5490a5 John Allen          2016-01-14  503  			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
cb5490a5 John Allen          2016-01-14  504  
6add7cd6 Nathan Fontenot     2011-01-31  505  	if (ret)
9f0af69b Nikanth Karthikesan 2011-03-24  506  		goto out;
6add7cd6 Nathan Fontenot     2011-01-31  507  
9f0af69b Nikanth Karthikesan 2011-03-24  508  	ret = count;
9f0af69b Nikanth Karthikesan 2011-03-24  509  out:

:::::: The code at line 501 was first introduced by commit
:::::: bc02af93dd2bbddce1b55e0a493f833a1b7cf140 [PATCH] pgdat allocation for new node add (specify node id)

:::::: TO: Yasunori Goto <y-goto@jp.fujitsu.com>
:::::: CC: Linus Torvalds <torvalds@g5.osdl.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34552 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-11 17:12   ` Mark Rutland
@ 2017-04-14 14:01     ` Andrea Reale
  2017-04-18 18:21       ` Mark Rutland
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Reale @ 2017-04-14 14:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, f.fainelli, m.bielski, scott.branden,
	will.deacon, linux-kernel, qiuxishi

Hi Mark,

thanks for your thorough feedback, it is really appreciated.

I have a few comments and I'd like to ask for further clarification,
if you don't mind, in order to help us work on a new version of the
patch.

Before I go into details, let me point out that, as you have noted
yourself, the structure of the patch is largely derived from the
x86_64 hot-remove code (mostly introduced in commit ae9aae9eda2db7
"memory-hotplug: common APIs to support page tables hot-remove") trying
to account for architectural differences. In this process, I guess it
is likely that I might have made assumptions that are true for x86_64
but do not hold for arm64. Whenever you feel this is the case, I would
be really grateful if you could help identify them.

Please, find detailed replies to your comments in-line.

On Tue, Apr 11, 2017 at 06:12:11PM +0100, Mark Rutland wrote:
> Hi,
> 
> On Tue, Apr 11, 2017 at 03:55:42PM +0100, Andrea Reale wrote:
> > +static inline unsigned long pmd_page_vaddr(pmd_t pmd)
> > +{
> > +	return (unsigned long) __va(pmd_page_paddr(pmd));
> > +}
> > +
> >  /* Find an entry in the third-level page table. */
> >  #define pte_index(addr)		(((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
> >  
> > @@ -450,6 +455,11 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
> >  	return pud_val(pud) & PHYS_MASK & (s32)PAGE_MASK;
> >  }
> >  
> > +static inline unsigned long pud_page_vaddr(pud_t pud)
> > +{
> > +	return (unsigned long) __va(pud_page_paddr(pud));
> > +}
> > +
> >  /* Find an entry in the second-level page table. */
> >  #define pmd_index(addr)		(((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> >  
> > @@ -502,6 +512,11 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
> >  	return pgd_val(pgd) & PHYS_MASK & (s32)PAGE_MASK;
> >  }
> >  
> > +static inline unsigned long pgd_page_vaddr(pgd_t pgd)
> > +{
> > +	return (unsigned long) __va(pgd_page_paddr(pgd));
> > +}
> > +
> 
> These duplicate the existing p*d_offset*() functions, and I do not think
> they are necessary. More on that below.

I agree they are redundant. It just seemed cleaner rather than
offsetting 0, but fair enough. 

> [...]
> 
> > @@ -551,7 +551,6 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> >  	unsigned long nr_pages = size >> PAGE_SHIFT;
> >  	unsigned long end_pfn = start_pfn + nr_pages;
> >  	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> > -	unsigned long pfn;
> >  	int ret;
> >  
> >  	if (end_pfn > max_sparsemem_pfn) {
> 
> Should this have been part of a prior patch?
> 
> This patch doesn't remove any users of this variable.
> 
> [...]

Indeed, the unused pfn variable is a leftover of patch 3/5. We will
fix that in the next version.

> > +static void  free_pagetable(struct page *page, int order, bool direct)
> 
> This "direct" parameter needs a better name, and a description in a
> comment block above this function.
 
The name direct is inherited directly from the x86_64 hot remove code.
It serves to distinguish if we are removing either a pagetable page that
is mapping to the direct mapping space (I think it is called also linear
mapping area somewhere else) or a pagetable page or a data page 
from vmemmap.

In this specific set of functions, the flag is needed because the various
alloc_init_p*d used for allocating entries for direct memory mapping
rely on pgd_pgtable_alloc, which in its turn calls  pgtable_page_ctor;
hence, we need to call the dtor.  On the contrary, vmemmap entries
are created using vmemmap_alloc_block (from within vmemmap_populate),
which does not call pgtable_page_ctor when allocating pages.

I am not sure I understand why the pgtable_page_ctor is not called when
allocating vmemmap page tables, but that's the current situation.

> > +{
> > +	unsigned long magic;
> > +	unsigned int nr_pages = 1 << order;
> > +	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
> > +
> > +	if (altmap) {
> > +		vmem_altmap_free(altmap, nr_pages);
> > +		return;
> > +	}
> > +
> > +	/* bootmem page has reserved flag */
> > +	if (PageReserved(page)) {
> > +		__ClearPageReserved(page);
> > +
> > +		magic = (unsigned long)page->lru.next;
> > +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> > +			while (nr_pages--)
> > +				put_page_bootmem(page++);
> > +		} else {
> > +			while (nr_pages--)
> > +				free_reserved_page(page++);
> > +		}
> > +	} else {
> > +		/*
> > +		 * Only direct pagetable allocation (those allocated via
> > +		 * hotplug) call the pgtable_page_ctor; vmemmap pgtable
> > +		 * allocations don't.
> > +		 */
> > +		if (direct)
> > +			pgtable_page_dtor(page);
> > +
> > +		free_pages((unsigned long)page_address(page), order);
> > +	}
> > +}
> 
> This largely looks like a copy of the x86 code. Why can that not be
> factored out to an arch-neutral location?

Yes it probably can - the only difference being calling
pgtable_page_dtor when it needs to - but I am not confident enough to
say that it would really be architecture neutral or just specific to
only arm64 and x86.  For example, I don't see this used anywhere else
for hot-removing memory.

(Actually, also a large part of remove_*_table and free_*_table could
probably be factored, but I wouldn't be sure how to deal with the
differences in the pgtable.h macros used throughout)

> > +
> > +static void free_pte_table(pmd_t *pmd, bool direct)
> > +{
> > +	pte_t *pte_start, *pte;
> > +	struct page *page;
> > +	int i;
> > +
> > +	pte_start =  (pte_t *) pmd_page_vaddr(*pmd);
> > +	/* Check if there is no valid entry in the PMD */
> > +	for (i = 0; i < PTRS_PER_PTE; i++) {
> > +		pte = pte_start + i;
> > +		if (!pte_none(*pte))
> > +			return;
> > +	}
> 
> If we must walk the tables in this way, please use the existing pattern
> from arch/arm64/mm/dump.c.
> 
> e.g. 
> 
> 	pte_t *pte;
> 
> 	pte = pte_offset_kernel(pmd, 0UL);
> 	for (i = 0; i < PTRS_PER_PTE; i++, pte++)
> 		if (!pte_none)
> 			return;

Thanks.

> > +
> > +	page = pmd_page(*pmd);
> > +
> > +	free_pagetable(page, 0, direct);
> 
> The page has been freed here, and may be subject to arbitrary
> modification...
> 
> > +
> > +	/*
> > +	 * This spin lock could be only taken in _pte_aloc_kernel
> > +	 * in mm/memory.c and nowhere else (for arm64). Not sure if
> > +	 * the function above can be called concurrently. In doubt,
> > +	 * I am living it here for now, but it probably can be removed
> > +	 */
> > +	spin_lock(&init_mm.page_table_lock);
> > +	pmd_clear(pmd);
> 
> ... but we only remove it from the page tables here, so the page table
> walkers can see junk in the tables, were the page reused in the mean
> timer.
> 
> After clearing the PMD, it needs to be cleared from TLBs. We allow
> partial walks to be cached, so the TLBs may still start walking this
> entry or beyond.

I guess that the safe approach would be something along the lines:
1. clear the page table 
2. flush the tlbs
3. free the page

am I mistaken? When I am flushing intermediate p*d entries, would it be
more appropriate to use something like __flush_tlb_pgtable() to clear
cached partial walks rather than flushing the whole table? I mean,
hot-remove is not going to be a frequent operation anyway, so I don't
think that flushing the whole tlb would be a great deal of harm
anyway.

My question at this point would be: how come that the code structure above
works for x86_64 hot-remove? My assumption, when I was writing this, was
that there would be no problem since this code is called when we are sure
that all the memory mapped by these entries has been already offlined,
so nobody should be using those VAs anyway (also considering that there
cannot be any two mem hot-plug/remove actions running concurrently).
Is that correct?
 
> > +	spin_unlock(&init_mm.page_table_lock);
> > +}
> 
> The same comments apply to all the free_p*d_table() functions, so I
> shan't repeat them.
> 
> [...]
> 
> > +static void remove_pte_table(pte_t *pte, unsigned long addr,
> > +	unsigned long end, bool direct)
> > +{
> > +	unsigned long next;
> > +	void *page_addr;
> > +
> > +	for (; addr < end; addr = next, pte++) {
> > +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> > +		if (next > end)
> > +			next = end;
> 
> Please use the usual p*d_addr_end() functions. See alloc_init_pmd() and
> friends in arch/arm64/mm/mmu.c for examples of how to use them.

we used the p*d_addr_end family of functions when dealing with p*d(s). I
cannot identify an equivalent for pte entries. Would you recommend adding
a pte_addr_end macro in pgtable.h?

> > +
> > +		if (!pte_present(*pte))
> > +			continue;
> > +
> > +		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
> 
> When would those addresses *not* be page-aligned? By construction, next
> must be. Unplugging partial pages of memory makes no sense, so surely
> addr is page-aligned when passed in?

The issue here is that this function is called in one of two cases:
1. to clear pagetables of directly mapped (linear) memory 
2. Pagetables (and corresponding data pages) for vmemmap. 

It is my understanding that, in the second case, we might be clearing
only part of the page content (i.e, only a few struct pages). Note that
next is page aligned by construction only if next <= end.

> > +			/*
> > +			 * Do not free direct mapping pages since they were
> > +			 * freed when offlining, or simplely not in use.
> > +			 */
> > +			if (!direct)
> > +				free_pagetable(pte_page(*pte), 0, direct);
> > +
> > +			/*
> > +			 * This spin lock could be only
> > +			 * taken in _pte_aloc_kernel in
> > +			 * mm/memory.c and nowhere else
> > +			 * (for arm64). Not sure if the
> > +			 * function above can be called
> > +			 * concurrently. In doubt,
> > +			 * I am living it here for now,
> > +			 * but it probably can be removed.
> > +			 */
> > +			spin_lock(&init_mm.page_table_lock);
> > +			pte_clear(&init_mm, addr, pte);
> > +			spin_unlock(&init_mm.page_table_lock);
> > +		} else {
> > +			/*
> > +			 * If we are here, we are freeing vmemmap pages since
> > +			 * direct mapped memory ranges to be freed are aligned.
> > +			 *
> > +			 * If we are not removing the whole page, it means
> > +			 * other page structs in this page are being used and
> > +			 * we canot remove them. So fill the unused page_structs
> > +			 * with 0xFD, and remove the page when it is wholly
> > +			 * filled with 0xFD.
> > +			 */
> > +			memset((void *)addr, PAGE_INUSE, next - addr);
> 
> What's special about 0xFD?

Just used it as a constant symmetrically to x86_64 code.

> Why do we need to mess with the page array in this manner? Why can't we
> detect when a range is free by querying memblock, for example?

I am not sure I get your suggestion. I guess that the logic here
is that I cannot be sure that I can free the full page because other
entries might be in use for active vmemmap mappings. So we just "mark"
the unused once and only free the page when all of it is marked. See
again commit ae9aae9eda2db71, where all this comes from.

> > +
> > +			page_addr = page_address(pte_page(*pte));
> > +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> > +				free_pagetable(pte_page(*pte), 0, direct);
> > +
> > +				/*
> > +				 * This spin lock could be only
> > +				 * taken in _pte_aloc_kernel in
> > +				 * mm/memory.c and nowhere else
> > +				 * (for arm64). Not sure if the
> > +				 * function above can be called
> > +				 * concurrently. In doubt,
> > +				 * I am living it here for now,
> > +				 * but it probably can be removed.
> > +				 */
> > +				spin_lock(&init_mm.page_table_lock);
> > +				pte_clear(&init_mm, addr, pte);
> > +				spin_unlock(&init_mm.page_table_lock);
> 
> This logic appears to be duplicated with the free_*_table functions, and
> looks incredibly suspicious.
> 
> To me, it doesn't make sense that we'd need the lock *only* to alter the
> leaf entries.

I admit I am still confused by the use of the page_table_lock and when
and where it should be used. Any hint on that would be more than
welcome.

> > +			}
> > +		}
> > +	}
> > +
> > +	// I am adding this flush here in simmetry to the x86 code.
> > +	// Why do I need to call it here and not in remove_p[mu]d
> > +	flush_tlb_all();
> 
> If the page tables weren't freed until this point, it might be that this
> is just amortizing the cost of removing them from the TLBs.
> 
> Given that they're freed first, this makes no sense to me.

Please, see above.

> > +}
> 
> The same commenst apply to all the remove_p*d_table() functions, so I
> shan't repeat them.
> 
> > +
> > +static void remove_pmd_table(pmd_t *pmd, unsigned long addr,
> > +	unsigned long end, bool direct)
> > +{
> > +	unsigned long next;
> > +	void *page_addr;
> > +	pte_t *pte;
> > +
> > +	for (; addr < end; addr = next, pmd++) {
> > +		next = pmd_addr_end(addr, end);
> > +
> > +		if (!pmd_present(*pmd))
> > +			continue;
> > +
> > +		// check if we are using 2MB section mappings
> > +		if (pmd_sect(*pmd)) {
> > +			if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
> 
> Surely you're intending to check if you can free the whole pmd? i.e.
> that addr and next are pmd-aligned?

Indeed, that's a mistake. It should have been IS_ALIGNED(addr, PMD_SIZE).

> Can we ever be in a situation where we're requested to free a partial
> pmd that could be section mapped?

Yes, as I said above, for vmemmap mappings.
 
> If that's the case, we'll *need* to split the pmd, which we can't do on
> live page tables.

Please, see below.

> > +				if (!direct) {
> > +					free_pagetable(pmd_page(*pmd),
> > +						get_order(PMD_SIZE), direct);
> > +				}
> > +				/*
> > +				 * This spin lock could be only
> > +				 * taken in _pte_aloc_kernel in
> > +				 * mm/memory.c and nowhere else
> > +				 * (for arm64). Not sure if the
> > +				 * function above can be called
> > +				 * concurrently. In doubt,
> > +				 * I am living it here for now,
> > +				 * but it probably can be removed.
> > +				 */
> > +				spin_lock(&init_mm.page_table_lock);
> > +				pmd_clear(pmd);
> > +				spin_unlock(&init_mm.page_table_lock);
> > +			} else {
> > +				/* If here, we are freeing vmemmap pages. */
> > +				memset((void *)addr, PAGE_INUSE, next - addr);
> > +
> > +				page_addr = page_address(pmd_page(*pmd));
> > +				if (!memchr_inv(page_addr, PAGE_INUSE,
> > +						PMD_SIZE)) {
> > +					free_pagetable(pmd_page(*pmd),
> > +						get_order(PMD_SIZE), direct);
> > +
> > +					/*
> > +					 * This spin lock could be only
> > +					 * taken in _pte_aloc_kernel in
> > +					 * mm/memory.c and nowhere else
> > +					 * (for arm64). Not sure if the
> > +					 * function above can be called
> > +					 * concurrently. In doubt,
> > +					 * I am living it here for now,
> > +					 * but it probably can be removed.
> > +					 */
> > +					spin_lock(&init_mm.page_table_lock);
> > +					pmd_clear(pmd);
> > +					spin_unlock(&init_mm.page_table_lock);
> > +				}
> 
> I don't think this is correct.
> 
> If we're getting rid of a partial pmd, we *must* split the pmd.
> Otherwise, we're leaving bits mapped that should not be. If we split the
> pmd, we can free the individual pages as we would for a non-section
> mapping.
> 
> As above, we can't split block entries within live tables, so that will
> be painful at best.
> 
> If we can't split a pmd, hen we cannot free a partial pmd, and will need
> to reject request to do so.
> 
> The same comments (with s/pmu/pud/, etc) apply for the higher level
> remove_p*d_table functions.
> 
> [...]

This only happens when we are clearing vmemmap memory.  My understanding
is that the whole hack of marking the content of partially unused areas
with the 0xFD constant is exactly to avoid splitting the PMD, but instead
to only clear the full area when we realize that there's no valid struct
page in it anymore. When would this kind of use be source of problems?

I am also realizing that vememmaps never use section maps at pud level, 
so this code would only need to stay when clearing pmds and ptes.

> > +void remove_pagetable(unsigned long start, unsigned long end, bool direct)
> > +{
> > +	unsigned long next;
> > +	unsigned long addr;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> > +
> > +	for (addr = start; addr < end; addr = next) {
> > +		next = pgd_addr_end(addr, end);
> > +
> > +		pgd = pgd_offset_k(addr);
> > +		if (pgd_none(*pgd))
> > +			continue;
> > +
> > +		pud = pud_offset(pgd, addr);
> > +		remove_pud_table(pud, addr, next, direct);
> > +		/*
> > +		 * When the PUD is folded on the PGD (three levels of paging),
> > +		 * I did already clear the PMD page in free_pmd_table,
> > +		 * and reset the corresponding PGD==PUD entry.
> > +		 */
> > +#if CONFIG_PGTABLE_LEVELS > 3
> > +		free_pud_table(pgd, direct);
> >  #endif
> 
> This looks suspicious. Shouldn't we have a similar check for PMD, to
> cater for CONFIG_PGTABLE_LEVELS == 2? e.g. 64K pages, 42-bit VA.
> 
> We should be able to hide this distinction in helpers for both cases.

True, we will fix it in the next version.

> > +	}
> > +
> > +	flush_tlb_all();
> 
> This is too late to be useful.
> 
> Thanks,
> Mark.
 
Thanks again for your comments.

Best,
Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-14 14:01     ` Andrea Reale
@ 2017-04-18 18:21       ` Mark Rutland
  2017-04-18 18:48         ` Ard Biesheuvel
  2017-04-21 10:02         ` Andrea Reale
  0 siblings, 2 replies; 18+ messages in thread
From: Mark Rutland @ 2017-04-18 18:21 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, f.fainelli, m.bielski, scott.branden,
	will.deacon, linux-kernel, qiuxishi, Ard Biesheuvel,
	Laura Abbott

On Fri, Apr 14, 2017 at 03:01:58PM +0100, Andrea Reale wrote:
> I guess it is likely that I might have made assumptions that are true
> for x86_64 but do not hold for arm64. Whenever you feel this is the
> case, I would be really grateful if you could help identify them.

Sure thing.

> On Tue, Apr 11, 2017 at 06:12:11PM +0100, Mark Rutland wrote:
> > On Tue, Apr 11, 2017 at 03:55:42PM +0100, Andrea Reale wrote:

> > > +static void  free_pagetable(struct page *page, int order, bool direct)
> > 
> > This "direct" parameter needs a better name, and a description in a
> > comment block above this function.
>  
> The name direct is inherited directly from the x86_64 hot remove code.
> It serves to distinguish if we are removing either a pagetable page that
> is mapping to the direct mapping space (I think it is called also linear
> mapping area somewhere else) or a pagetable page or a data page 
> from vmemmap.

FWIW, I've largely heard the folk call that the "linear mapping", and
x86 folk call that the "direct mapping". The two are interchangeable.

> In this specific set of functions, the flag is needed because the various
> alloc_init_p*d used for allocating entries for direct memory mapping
> rely on pgd_pgtable_alloc, which in its turn calls  pgtable_page_ctor;
> hence, we need to call the dtor. 

AFAICT, that's not true for the arm64 linear map, since that's created
with our early_pgtable_alloc(), which doesn't call pgtable_page_ctor().

Judging by commit:

  1378dc3d4ba07ccd ("arm64: mm: run pgtable_page_ctor() on non-swapper
                     translation table pages")

... we only do this for non-swapper page tables.

> On the contrary, vmemmap entries are created using vmemmap_alloc_block
> (from within vmemmap_populate), which does not call pgtable_page_ctor
> when allocating pages.
> 
> I am not sure I understand why the pgtable_page_ctor is not called when
> allocating vmemmap page tables, but that's the current situation.

>From a quick scan, I see that it's necessary to use pgtable_page_ctor()
for pages that will be used for userspace page tables, but it's not
clear to me if it's ever necessary for pages used for kernel page
tables.

If it is, we appear to have a bug on arm64.

Laura, Ard, thoughts?

> > > +{
> > > +	unsigned long magic;
> > > +	unsigned int nr_pages = 1 << order;
> > > +	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
> > > +
> > > +	if (altmap) {
> > > +		vmem_altmap_free(altmap, nr_pages);
> > > +		return;
> > > +	}
> > > +
> > > +	/* bootmem page has reserved flag */
> > > +	if (PageReserved(page)) {
> > > +		__ClearPageReserved(page);
> > > +
> > > +		magic = (unsigned long)page->lru.next;
> > > +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> > > +			while (nr_pages--)
> > > +				put_page_bootmem(page++);
> > > +		} else {
> > > +			while (nr_pages--)
> > > +				free_reserved_page(page++);
> > > +		}
> > > +	} else {
> > > +		/*
> > > +		 * Only direct pagetable allocation (those allocated via
> > > +		 * hotplug) call the pgtable_page_ctor; vmemmap pgtable
> > > +		 * allocations don't.
> > > +		 */
> > > +		if (direct)
> > > +			pgtable_page_dtor(page);
> > > +
> > > +		free_pages((unsigned long)page_address(page), order);
> > > +	}
> > > +}
> > 
> > This largely looks like a copy of the x86 code. Why can that not be
> > factored out to an arch-neutral location?
> 
> Yes it probably can - the only difference being calling
> pgtable_page_dtor when it needs to - but I am not confident enough to
> say that it would really be architecture neutral or just specific to
> only arm64 and x86.  For example, I don't see this used anywhere else
> for hot-removing memory.

Mhmmm. As above, it's also not clear to me if the ctor()/dtor() dance is
necessary in the first place.

I'm also not clear on why x86_64 and powerpc are the only architectures
that appear to clear up their page tables after __remove_pages(). Do
other architectures not have "real" hot-remove?

> (Actually, also a large part of remove_*_table and free_*_table could
> probably be factored, but I wouldn't be sure how to deal with the
> differences in the pgtable.h macros used throughout)

Let's figure out what's necessary first. Then we'll know if/how we can
align on a common pattern.

[...]

>> > +	page = pmd_page(*pmd);
> > > +
> > > +	free_pagetable(page, 0, direct);
> > 
> > The page has been freed here, and may be subject to arbitrary
> > modification...
> > 
> > > +
> > > +	/*
> > > +	 * This spin lock could be only taken in _pte_aloc_kernel
> > > +	 * in mm/memory.c and nowhere else (for arm64). Not sure if
> > > +	 * the function above can be called concurrently. In doubt,
> > > +	 * I am living it here for now, but it probably can be removed
> > > +	 */
> > > +	spin_lock(&init_mm.page_table_lock);
> > > +	pmd_clear(pmd);
> > 
> > ... but we only remove it from the page tables here, so the page table
> > walkers can see junk in the tables, were the page reused in the mean
> > timer.
> > 
> > After clearing the PMD, it needs to be cleared from TLBs. We allow
> > partial walks to be cached, so the TLBs may still start walking this
> > entry or beyond.
> 
> I guess that the safe approach would be something along the lines:
> 1. clear the page table 
> 2. flush the tlbs
> 3. free the page

Yes. That's the sequence we follow elsewhere.

> When I am flushing intermediate p*d entries, would it be
> more appropriate to use something like __flush_tlb_pgtable() to clear
> cached partial walks rather than flushing the whole table? I mean,
> hot-remove is not going to be a frequent operation anyway, so I don't
> think that flushing the whole tlb would be a great deal of harm
> anyway.

Using __flush_tlb_pgtable() sounds sane to me. That's what we do when
tearing down user mappings.

> My question at this point would be: how come that the code structure above
> works for x86_64 hot-remove?

I don't know enough about x86 to say.

> My assumption, when I was writing this, was
> that there would be no problem since this code is called when we are sure
> that all the memory mapped by these entries has been already offlined,
> so nobody should be using those VAs anyway (also considering that there
> cannot be any two mem hot-plug/remove actions running concurrently).
> Is that correct?

The problem is that speculation, Out-of-Order execution, HW prefetching,
and other such things *can* result in those VAs being accessed,
regardless of what code explicitly does in a sequential execution.

If any table relevant to one of those VAs has been freed (and
potentially modified by the allocator, or in another thread), it's
possible that the CPU performs a page table walk and sees junk as a
valid page table entry. As a result, a number of bad things can happen.

If the entry was a pgd, pud, or pmd, the CPU may try to continue to walk
to the relevant pte, accessing a junk address. That could change the
state of MMIO, trigger an SError, etc, or allocate more junk into the
TLBs.

For any level, the CPU might allocate the entry into a TLB, regardless
of whether an existing entry existed. The new entry can conflict with
the existing one, either leading to a TLB conflict abort, or to the TLB
returning junk for that address. Speculation and so on can now access
junk based on that, etc.

[...]

> > > +static void remove_pte_table(pte_t *pte, unsigned long addr,
> > > +	unsigned long end, bool direct)
> > > +{
> > > +	unsigned long next;
> > > +	void *page_addr;
> > > +
> > > +	for (; addr < end; addr = next, pte++) {
> > > +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> > > +		if (next > end)
> > > +			next = end;
> > 
> > Please use the usual p*d_addr_end() functions. See alloc_init_pmd() and
> > friends in arch/arm64/mm/mmu.c for examples of how to use them.
> 
> we used the p*d_addr_end family of functions when dealing with p*d(s). I
> cannot identify an equivalent for pte entries.

Ah; my bad.

I guess we should follow the same pattern as alloc_init_pte() does here,
assuming we use p*d_addr_end() for all the levels above (as for
alloc_init_p*d()).

> Would you recommend adding a pte_addr_end macro in pgtable.h?

That shouldn't be necessary. Sorry for the confusion.

> > > +
> > > +		if (!pte_present(*pte))
> > > +			continue;
> > > +
> > > +		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
> > 
> > When would those addresses *not* be page-aligned? By construction, next
> > must be. Unplugging partial pages of memory makes no sense, so surely
> > addr is page-aligned when passed in?
> 
> The issue here is that this function is called in one of two cases:
> 1. to clear pagetables of directly mapped (linear) memory 
> 2. Pagetables (and corresponding data pages) for vmemmap. 
> 
> It is my understanding that, in the second case, we might be clearing
> only part of the page content (i.e, only a few struct pages). Note that
> next is page aligned by construction only if next <= end.

Ok. A comment to that effect immediately above this check would be
helpful.

> > > +			/*
> > > +			 * Do not free direct mapping pages since they were
> > > +			 * freed when offlining, or simplely not in use.
> > > +			 */
> > > +			if (!direct)
> > > +				free_pagetable(pte_page(*pte), 0, direct);
> > > +
> > > +			/*
> > > +			 * This spin lock could be only
> > > +			 * taken in _pte_aloc_kernel in
> > > +			 * mm/memory.c and nowhere else
> > > +			 * (for arm64). Not sure if the
> > > +			 * function above can be called
> > > +			 * concurrently. In doubt,
> > > +			 * I am living it here for now,
> > > +			 * but it probably can be removed.
> > > +			 */
> > > +			spin_lock(&init_mm.page_table_lock);
> > > +			pte_clear(&init_mm, addr, pte);
> > > +			spin_unlock(&init_mm.page_table_lock);
> > > +		} else {
> > > +			/*
> > > +			 * If we are here, we are freeing vmemmap pages since
> > > +			 * direct mapped memory ranges to be freed are aligned.
> > > +			 *
> > > +			 * If we are not removing the whole page, it means
> > > +			 * other page structs in this page are being used and
> > > +			 * we canot remove them. So fill the unused page_structs
> > > +			 * with 0xFD, and remove the page when it is wholly
> > > +			 * filled with 0xFD.
> > > +			 */
> > > +			memset((void *)addr, PAGE_INUSE, next - addr);
> > 
> > What's special about 0xFD?
> 
> Just used it as a constant symmetrically to x86_64 code.
> 
> > Why do we need to mess with the page array in this manner? Why can't we
> > detect when a range is free by querying memblock, for example?
> 
> I am not sure I get your suggestion. I guess that the logic here
> is that I cannot be sure that I can free the full page because other
> entries might be in use for active vmemmap mappings. So we just "mark"
> the unused once and only free the page when all of it is marked. See
> again commit ae9aae9eda2db71, where all this comes from.

I understood that this is deferring freeing until a whole page of struct
pages has been freed.

My concern is that filling the unused memory with an array of junk chars
feels like a hack. We don't do this at the edges when we allocate
memblock today, AFAICT, so this doesn't seem complete.

Is there no "real" datastructure we can use to keep track of what memory
is present? e.g. memblock?

[...]

> > > +static void remove_pmd_table(pmd_t *pmd, unsigned long addr,
> > > +	unsigned long end, bool direct)
> > > +{
> > > +	unsigned long next;
> > > +	void *page_addr;
> > > +	pte_t *pte;
> > > +
> > > +	for (; addr < end; addr = next, pmd++) {
> > > +		next = pmd_addr_end(addr, end);
> > > +
> > > +		if (!pmd_present(*pmd))
> > > +			continue;
> > > +
> > > +		// check if we are using 2MB section mappings
> > > +		if (pmd_sect(*pmd)) {
> > > +			if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
> > 
> > Surely you're intending to check if you can free the whole pmd? i.e.
> > that addr and next are pmd-aligned?
> 
> Indeed, that's a mistake. It should have been IS_ALIGNED(addr, PMD_SIZE).

Ok.

> > Can we ever be in a situation where we're requested to free a partial
> > pmd that could be section mapped?
> 
> Yes, as I said above, for vmemmap mappings.

Ok.

> > If that's the case, we'll *need* to split the pmd, which we can't do on
> > live page tables.
> 
> Please, see below.
> 
> > > +				if (!direct) {
> > > +					free_pagetable(pmd_page(*pmd),
> > > +						get_order(PMD_SIZE), direct);
> > > +				}
> > > +				/*
> > > +				 * This spin lock could be only
> > > +				 * taken in _pte_aloc_kernel in
> > > +				 * mm/memory.c and nowhere else
> > > +				 * (for arm64). Not sure if the
> > > +				 * function above can be called
> > > +				 * concurrently. In doubt,
> > > +				 * I am living it here for now,
> > > +				 * but it probably can be removed.
> > > +				 */
> > > +				spin_lock(&init_mm.page_table_lock);
> > > +				pmd_clear(pmd);
> > > +				spin_unlock(&init_mm.page_table_lock);
> > > +			} else {
> > > +				/* If here, we are freeing vmemmap pages. */
> > > +				memset((void *)addr, PAGE_INUSE, next - addr);
> > > +
> > > +				page_addr = page_address(pmd_page(*pmd));
> > > +				if (!memchr_inv(page_addr, PAGE_INUSE,
> > > +						PMD_SIZE)) {
> > > +					free_pagetable(pmd_page(*pmd),
> > > +						get_order(PMD_SIZE), direct);
> > > +
> > > +					/*
> > > +					 * This spin lock could be only
> > > +					 * taken in _pte_aloc_kernel in
> > > +					 * mm/memory.c and nowhere else
> > > +					 * (for arm64). Not sure if the
> > > +					 * function above can be called
> > > +					 * concurrently. In doubt,
> > > +					 * I am living it here for now,
> > > +					 * but it probably can be removed.
> > > +					 */
> > > +					spin_lock(&init_mm.page_table_lock);
> > > +					pmd_clear(pmd);
> > > +					spin_unlock(&init_mm.page_table_lock);
> > > +				}
> > 
> > I don't think this is correct.
> > 
> > If we're getting rid of a partial pmd, we *must* split the pmd.
> > Otherwise, we're leaving bits mapped that should not be. If we split the
> > pmd, we can free the individual pages as we would for a non-section
> > mapping.
> > 
> > As above, we can't split block entries within live tables, so that will
> > be painful at best.
> > 
> > If we can't split a pmd, hen we cannot free a partial pmd, and will need
> > to reject request to do so.
> > 
> > The same comments (with s/pmu/pud/, etc) apply for the higher level
> > remove_p*d_table functions.
> > 
> > [...]
> 
> This only happens when we are clearing vmemmap memory. 

Is that definitely the case?

Currently, I can't see what prevents adding 2M of memory, and then
removing the first 4K of that. We'll use a 2M section for the linear map
of that, but won't unmap the 4K when removing.

Likewise for the next level up, with s/2M/1G/ and s/4K/2M/.

> My understanding
> is that the whole hack of marking the content of partially unused areas
> with the 0xFD constant is exactly to avoid splitting the PMD, but instead
> to only clear the full area when we realize that there's no valid struct
> page in it anymore. When would this kind of use be source of problems?

I understand what's going on for the vmemmap. So long as we don't use
hotpluggable memory to back the vmemmap, that's fine.

As above, my concern is whether splitting a section can ever occur for
the linear map.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-18 18:21       ` Mark Rutland
@ 2017-04-18 18:48         ` Ard Biesheuvel
  2017-04-19 15:53           ` Laura Abbott
  2017-04-21 10:02         ` Andrea Reale
  1 sibling, 1 reply; 18+ messages in thread
From: Ard Biesheuvel @ 2017-04-18 18:48 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andrea Reale, linux-arm-kernel, Florian Fainelli, m.bielski,
	scott.branden, Will Deacon, linux-kernel, Xishi Qiu,
	Laura Abbott

On 18 April 2017 at 19:21, Mark Rutland <mark.rutland@arm.com> wrote:
> On Fri, Apr 14, 2017 at 03:01:58PM +0100, Andrea Reale wrote:
>> I guess it is likely that I might have made assumptions that are true
>> for x86_64 but do not hold for arm64. Whenever you feel this is the
>> case, I would be really grateful if you could help identify them.
>
> Sure thing.
>
>> On Tue, Apr 11, 2017 at 06:12:11PM +0100, Mark Rutland wrote:
>> > On Tue, Apr 11, 2017 at 03:55:42PM +0100, Andrea Reale wrote:
>
>> > > +static void  free_pagetable(struct page *page, int order, bool direct)
>> >
>> > This "direct" parameter needs a better name, and a description in a
>> > comment block above this function.
>>
>> The name direct is inherited directly from the x86_64 hot remove code.
>> It serves to distinguish if we are removing either a pagetable page that
>> is mapping to the direct mapping space (I think it is called also linear
>> mapping area somewhere else) or a pagetable page or a data page
>> from vmemmap.
>
> FWIW, I've largely heard the folk call that the "linear mapping", and
> x86 folk call that the "direct mapping". The two are interchangeable.
>
>> In this specific set of functions, the flag is needed because the various
>> alloc_init_p*d used for allocating entries for direct memory mapping
>> rely on pgd_pgtable_alloc, which in its turn calls  pgtable_page_ctor;
>> hence, we need to call the dtor.
>
> AFAICT, that's not true for the arm64 linear map, since that's created
> with our early_pgtable_alloc(), which doesn't call pgtable_page_ctor().
>
> Judging by commit:
>
>   1378dc3d4ba07ccd ("arm64: mm: run pgtable_page_ctor() on non-swapper
>                      translation table pages")
>
> ... we only do this for non-swapper page tables.
>
>> On the contrary, vmemmap entries are created using vmemmap_alloc_block
>> (from within vmemmap_populate), which does not call pgtable_page_ctor
>> when allocating pages.
>>
>> I am not sure I understand why the pgtable_page_ctor is not called when
>> allocating vmemmap page tables, but that's the current situation.
>
> From a quick scan, I see that it's necessary to use pgtable_page_ctor()
> for pages that will be used for userspace page tables, but it's not
> clear to me if it's ever necessary for pages used for kernel page
> tables.
>
> If it is, we appear to have a bug on arm64.
>
> Laura, Ard, thoughts?
>

The generic apply_to_page_range() will expect the PTE lock to be
initialized for page table pages that are not part of init_mm. For
arm64, that is precisely efi_mm as far as I am aware. For EFI, the
locking is unnecessary but does no harm (the permissions are set once
via apply_to_page_range() at boot), so I added this call when adding
support for strict permissions in EFI rt services mappings.

So I think it is appropriate for create_pgd_mapping() to be in charge
of calling the ctor(). We simply have no destroy_pgd_mapping()
counterpart that would be the place for the dtor() call, given that we
never take down EFI rt services mappings.

Whether it makes sense or not to lock/unlock in apply_to_page_range()
is something I did not spend any brain cycles on at the time.

-- 
Ard.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-18 18:48         ` Ard Biesheuvel
@ 2017-04-19 15:53           ` Laura Abbott
  2017-04-21 10:05             ` Andrea Reale
  0 siblings, 1 reply; 18+ messages in thread
From: Laura Abbott @ 2017-04-19 15:53 UTC (permalink / raw)
  To: Ard Biesheuvel, Mark Rutland
  Cc: Andrea Reale, linux-arm-kernel, Florian Fainelli, m.bielski,
	scott.branden, Will Deacon, linux-kernel, Xishi Qiu

On 04/18/2017 11:48 AM, Ard Biesheuvel wrote:
> On 18 April 2017 at 19:21, Mark Rutland <mark.rutland@arm.com> wrote:
>> On Fri, Apr 14, 2017 at 03:01:58PM +0100, Andrea Reale wrote:
>>> I guess it is likely that I might have made assumptions that are true
>>> for x86_64 but do not hold for arm64. Whenever you feel this is the
>>> case, I would be really grateful if you could help identify them.
>>
>> Sure thing.
>>
>>> On Tue, Apr 11, 2017 at 06:12:11PM +0100, Mark Rutland wrote:
>>>> On Tue, Apr 11, 2017 at 03:55:42PM +0100, Andrea Reale wrote:
>>
>>>>> +static void  free_pagetable(struct page *page, int order, bool direct)
>>>>
>>>> This "direct" parameter needs a better name, and a description in a
>>>> comment block above this function.
>>>
>>> The name direct is inherited directly from the x86_64 hot remove code.
>>> It serves to distinguish if we are removing either a pagetable page that
>>> is mapping to the direct mapping space (I think it is called also linear
>>> mapping area somewhere else) or a pagetable page or a data page
>>> from vmemmap.
>>
>> FWIW, I've largely heard the folk call that the "linear mapping", and
>> x86 folk call that the "direct mapping". The two are interchangeable.
>>
>>> In this specific set of functions, the flag is needed because the various
>>> alloc_init_p*d used for allocating entries for direct memory mapping
>>> rely on pgd_pgtable_alloc, which in its turn calls  pgtable_page_ctor;
>>> hence, we need to call the dtor.
>>
>> AFAICT, that's not true for the arm64 linear map, since that's created
>> with our early_pgtable_alloc(), which doesn't call pgtable_page_ctor().
>>
>> Judging by commit:
>>
>>    1378dc3d4ba07ccd ("arm64: mm: run pgtable_page_ctor() on non-swapper
>>                       translation table pages")
>>
>> ... we only do this for non-swapper page tables.
>>
>>> On the contrary, vmemmap entries are created using vmemmap_alloc_block
>>> (from within vmemmap_populate), which does not call pgtable_page_ctor
>>> when allocating pages.
>>>
>>> I am not sure I understand why the pgtable_page_ctor is not called when
>>> allocating vmemmap page tables, but that's the current situation.
>>
>>  From a quick scan, I see that it's necessary to use pgtable_page_ctor()
>> for pages that will be used for userspace page tables, but it's not
>> clear to me if it's ever necessary for pages used for kernel page
>> tables.
>>
>> If it is, we appear to have a bug on arm64.
>>
>> Laura, Ard, thoughts?
>>
> 
> The generic apply_to_page_range() will expect the PTE lock to be
> initialized for page table pages that are not part of init_mm. For
> arm64, that is precisely efi_mm as far as I am aware. For EFI, the
> locking is unnecessary but does no harm (the permissions are set once
> via apply_to_page_range() at boot), so I added this call when adding
> support for strict permissions in EFI rt services mappings.
> 
> So I think it is appropriate for create_pgd_mapping() to be in charge
> of calling the ctor(). We simply have no destroy_pgd_mapping()
> counterpart that would be the place for the dtor() call, given that we
> never take down EFI rt services mappi >
> Whether it makes sense or not to lock/unlock in apply_to_page_range()
> is something I did not spend any brain cycles on at the time.
> 

Agreed there shouldn't be a problem right now. I do think the locking is
appropriate in apply_to_page_range given what other functions also get
locked.

I really wish this were less asymmetrical though since it get hard
to reason about. It looks like hotplug_paging will call the ctor,
so is there an issue with calling hot-remove on memory that was once
hot-added or is that not a concern?

Thanks,
Laura

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-18 18:21       ` Mark Rutland
  2017-04-18 18:48         ` Ard Biesheuvel
@ 2017-04-21 10:02         ` Andrea Reale
  1 sibling, 0 replies; 18+ messages in thread
From: Andrea Reale @ 2017-04-21 10:02 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, f.fainelli, m.bielski, scott.branden,
	will.deacon, linux-kernel, qiuxishi, Ard Biesheuvel,
	Laura Abbott

Hi all,

Thanks Mark, Ard and Laura for your comments. Replies in-line.

On Tue, Apr 18, 2017 at 07:21:26PM +0100, Mark Rutland wrote:
[...]
> >  
> > The name direct is inherited directly from the x86_64 hot remove code.
> > It serves to distinguish if we are removing either a pagetable page that
> > is mapping to the direct mapping space (I think it is called also linear
> > mapping area somewhere else) or a pagetable page or a data page 
> > from vmemmap.
> 
> FWIW, I've largely heard the folk call that the "linear mapping", and
> x86 folk call that the "direct mapping". The two are interchangeable.
> 

Thanks for the clarification; We'll just call it "linear mapping" then.

> > In this specific set of functions, the flag is needed because the various
> > alloc_init_p*d used for allocating entries for direct memory mapping
> > rely on pgd_pgtable_alloc, which in its turn calls  pgtable_page_ctor;
> > hence, we need to call the dtor. 
> 
> AFAICT, that's not true for the arm64 linear map, since that's created
> with our early_pgtable_alloc(), which doesn't call pgtable_page_ctor().
> 
> Judging by commit:
> 
>   1378dc3d4ba07ccd ("arm64: mm: run pgtable_page_ctor() on non-swapper
>                      translation table pages")
> 
> ... we only do this for non-swapper page tables.
> 
> > On the contrary, vmemmap entries are created using vmemmap_alloc_block
> > (from within vmemmap_populate), which does not call pgtable_page_ctor
> > when allocating pages.
> > 
> > I am not sure I understand why the pgtable_page_ctor is not called when
> > allocating vmemmap page tables, but that's the current situation.
> 
> From a quick scan, I see that it's necessary to use pgtable_page_ctor()
> for pages that will be used for userspace page tables, but it's not
> clear to me if it's ever necessary for pages used for kernel page
> tables.
> 
> If it is, we appear to have a bug on arm64.
> 
> Laura, Ard, thoughts?
> 

More comments on that as a separate reply to Laura's and Ard's messages.

[...]

> > 
> > I guess that the safe approach would be something along the lines:
> > 1. clear the page table 
> > 2. flush the tlbs
> > 3. free the page
> 
> Yes. That's the sequence we follow elsewhere.
> 
> > When I am flushing intermediate p*d entries, would it be
> > more appropriate to use something like __flush_tlb_pgtable() to clear
> > cached partial walks rather than flushing the whole table? I mean,
> > hot-remove is not going to be a frequent operation anyway, so I don't
> > think that flushing the whole tlb would be a great deal of harm
> > anyway.
> 
> Using __flush_tlb_pgtable() sounds sane to me. That's what we do when
> tearing down user mappings.
> 
> > My question at this point would be: how come that the code structure above
> > works for x86_64 hot-remove?
> 
> I don't know enough about x86 to say.
> 
> > My assumption, when I was writing this, was
> > that there would be no problem since this code is called when we are sure
> > that all the memory mapped by these entries has been already offlined,
> > so nobody should be using those VAs anyway (also considering that there
> > cannot be any two mem hot-plug/remove actions running concurrently).
> > Is that correct?
> 
> The problem is that speculation, Out-of-Order execution, HW prefetching,
> and other such things *can* result in those VAs being accessed,
> regardless of what code explicitly does in a sequential execution.
> 
> If any table relevant to one of those VAs has been freed (and
> potentially modified by the allocator, or in another thread), it's
> possible that the CPU performs a page table walk and sees junk as a
> valid page table entry. As a result, a number of bad things can happen.
> 
> If the entry was a pgd, pud, or pmd, the CPU may try to continue to walk
> to the relevant pte, accessing a junk address. That could change the
> state of MMIO, trigger an SError, etc, or allocate more junk into the
> TLBs.
> 
> For any level, the CPU might allocate the entry into a TLB, regardless
> of whether an existing entry existed. The new entry can conflict with
> the existing one, either leading to a TLB conflict abort, or to the TLB
> returning junk for that address. Speculation and so on can now access
> junk based on that, etc.
> 
> [...]
> 

Thanks, we'll just clear it the proper way.

[...]

> > > > +
> > > > +		if (!pte_present(*pte))
> > > > +			continue;
> > > > +
> > > > +		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
> > > 
> > > When would those addresses *not* be page-aligned? By construction, next
> > > must be. Unplugging partial pages of memory makes no sense, so surely
> > > addr is page-aligned when passed in?
> > 
> > The issue here is that this function is called in one of two cases:
> > 1. to clear pagetables of directly mapped (linear) memory 
> > 2. Pagetables (and corresponding data pages) for vmemmap. 
> > 
> > It is my understanding that, in the second case, we might be clearing
> > only part of the page content (i.e, only a few struct pages). Note that
> > next is page aligned by construction only if next <= end.
> 
> Ok. A comment to that effect immediately above this check would be
> helpful.
> 

Ok, thanks.

> > > > +			/*
> > > > +			 * Do not free direct mapping pages since they were
> > > > +			 * freed when offlining, or simplely not in use.
> > > > +			 */
> > > > +			if (!direct)
> > > > +				free_pagetable(pte_page(*pte), 0, direct);
> > > > +
> > > > +			/*
> > > > +			 * This spin lock could be only
> > > > +			 * taken in _pte_aloc_kernel in
> > > > +			 * mm/memory.c and nowhere else
> > > > +			 * (for arm64). Not sure if the
> > > > +			 * function above can be called
> > > > +			 * concurrently. In doubt,
> > > > +			 * I am living it here for now,
> > > > +			 * but it probably can be removed.
> > > > +			 */
> > > > +			spin_lock(&init_mm.page_table_lock);
> > > > +			pte_clear(&init_mm, addr, pte);
> > > > +			spin_unlock(&init_mm.page_table_lock);
> > > > +		} else {
> > > > +			/*
> > > > +			 * If we are here, we are freeing vmemmap pages since
> > > > +			 * direct mapped memory ranges to be freed are aligned.
> > > > +			 *
> > > > +			 * If we are not removing the whole page, it means
> > > > +			 * other page structs in this page are being used and
> > > > +			 * we canot remove them. So fill the unused page_structs
> > > > +			 * with 0xFD, and remove the page when it is wholly
> > > > +			 * filled with 0xFD.
> > > > +			 */
> > > > +			memset((void *)addr, PAGE_INUSE, next - addr);
> > > 
> > > What's special about 0xFD?
> > 
> > Just used it as a constant symmetrically to x86_64 code.
> > 
> > > Why do we need to mess with the page array in this manner? Why can't we
> > > detect when a range is free by querying memblock, for example?
> > 
> > I am not sure I get your suggestion. I guess that the logic here
> > is that I cannot be sure that I can free the full page because other
> > entries might be in use for active vmemmap mappings. So we just "mark"
> > the unused once and only free the page when all of it is marked. See
> > again commit ae9aae9eda2db71, where all this comes from.
> 
> I understood that this is deferring freeing until a whole page of struct
> pages has been freed.
> 
> My concern is that filling the unused memory with an array of junk chars
> feels like a hack. We don't do this at the edges when we allocate
> memblock today, AFAICT, so this doesn't seem complete.
> 
> Is there no "real" datastructure we can use to keep track of what memory
> is present? e.g. memblock?
> 
> [...]

We could add a MEMBLOCK_VMEMMAP_UNUSED flag in memblock and mark the
partially unused memblock range with that instead of using the 0xFD
hack. Eventually that might even be backported to x86. Is that what
you are suggesting? I am not confident we can reuse an existing flag
for the purpose without breaking something else.

[...]

> > > If that's the case, we'll *need* to split the pmd, which we can't do on
> > > live page tables.
> > 
> > Please, see below.
> > 
> > > > +				if (!direct) {
> > > > +					free_pagetable(pmd_page(*pmd),
> > > > +						get_order(PMD_SIZE), direct);
> > > > +				}
> > > > +				/*
> > > > +				 * This spin lock could be only
> > > > +				 * taken in _pte_aloc_kernel in
> > > > +				 * mm/memory.c and nowhere else
> > > > +				 * (for arm64). Not sure if the
> > > > +				 * function above can be called
> > > > +				 * concurrently. In doubt,
> > > > +				 * I am living it here for now,
> > > > +				 * but it probably can be removed.
> > > > +				 */
> > > > +				spin_lock(&init_mm.page_table_lock);
> > > > +				pmd_clear(pmd);
> > > > +				spin_unlock(&init_mm.page_table_lock);
> > > > +			} else {
> > > > +				/* If here, we are freeing vmemmap pages. */
> > > > +				memset((void *)addr, PAGE_INUSE, next - addr);
> > > > +
> > > > +				page_addr = page_address(pmd_page(*pmd));
> > > > +				if (!memchr_inv(page_addr, PAGE_INUSE,
> > > > +						PMD_SIZE)) {
> > > > +					free_pagetable(pmd_page(*pmd),
> > > > +						get_order(PMD_SIZE), direct);
> > > > +
> > > > +					/*
> > > > +					 * This spin lock could be only
> > > > +					 * taken in _pte_aloc_kernel in
> > > > +					 * mm/memory.c and nowhere else
> > > > +					 * (for arm64). Not sure if the
> > > > +					 * function above can be called
> > > > +					 * concurrently. In doubt,
> > > > +					 * I am living it here for now,
> > > > +					 * but it probably can be removed.
> > > > +					 */
> > > > +					spin_lock(&init_mm.page_table_lock);
> > > > +					pmd_clear(pmd);
> > > > +					spin_unlock(&init_mm.page_table_lock);
> > > > +				}
> > > 
> > > I don't think this is correct.
> > > 
> > > If we're getting rid of a partial pmd, we *must* split the pmd.
> > > Otherwise, we're leaving bits mapped that should not be. If we split the
> > > pmd, we can free the individual pages as we would for a non-section
> > > mapping.
> > > 
> > > As above, we can't split block entries within live tables, so that will
> > > be painful at best.
> > > 
> > > If we can't split a pmd, hen we cannot free a partial pmd, and will need
> > > to reject request to do so.
> > > 
> > > The same comments (with s/pmu/pud/, etc) apply for the higher level
> > > remove_p*d_table functions.
> > > 
> > > [...]
> > 
> > This only happens when we are clearing vmemmap memory. 
> 
> Is that definitely the case?
> 
> Currently, I can't see what prevents adding 2M of memory, and then
> removing the first 4K of that. We'll use a 2M section for the linear map
> of that, but won't unmap the 4K when removing.
> 
> Likewise for the next level up, with s/2M/1G/ and s/4K/2M/.
> 

You're right. The confusion comes from the fact that we were only
considering the case where we are hot-removing memory that was previously
hot-added. In that case, the memory granularity of hot-add and hot-remove
is the same and fixed at compile time to get_memory_block_size() ==
1<<SECTION_SIZE_BITS (1GB by default).

In case we are removing memory that was in the linear mapping since boot,
then we might get in the situation you are describing above if someone has
manually changed SECTION_SIZE_BITS to a different value in sparsemem.h.
If not, I believe that the fact that we are removing one aligned GB per
shot should guarantee that we never have to split a PUD.

However, we should definitely handle all the cases. Since splitting a
P[UM]D might be a nightmare (if possible at all), two more or less clean
solutions I can think of are:
1. Only allow to hot-remove memory that was previously hot-added or,
2. Detect when we are trying to only partially remove a section mapped
   area of the linear mapping and just fail (and roll back) the remove
   process.

I think I prefer no. 2; it has the advantage of not forbidding a-priori
to remove non-hot-added memory; and its only disadvantage is that in some
(uncommon?) cases you would be just forbidden to hot-remove some memory,
with no distructive side effects.  What do you think?

[...]
 
> Thanks,
> Mark.
> 

Thanks again and best regards,
Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-19 15:53           ` Laura Abbott
@ 2017-04-21 10:05             ` Andrea Reale
  2017-04-24 23:59               ` Laura Abbott
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Reale @ 2017-04-21 10:05 UTC (permalink / raw)
  To: Laura Abbott
  Cc: Ard Biesheuvel, Mark Rutland, linux-arm-kernel, Florian Fainelli,
	m.bielski, scott.branden, Will Deacon, linux-kernel, Xishi Qiu

Hi all,

thanks for taking the time to comment. Replies in-line.

On Wed, Apr 19, 2017 at 08:53:13AM -0700, Laura Abbott wrote:
> On 04/18/2017 11:48 AM, Ard Biesheuvel wrote:
> >On 18 April 2017 at 19:21, Mark Rutland <mark.rutland@arm.com> wrote:
> >>On Fri, Apr 14, 2017 at 03:01:58PM +0100, Andrea Reale wrote:

[...]

> >>
> >> From a quick scan, I see that it's necessary to use pgtable_page_ctor()
> >>for pages that will be used for userspace page tables, but it's not
> >>clear to me if it's ever necessary for pages used for kernel page
> >>tables.
> >>
> >>If it is, we appear to have a bug on arm64.
> >>
> >>Laura, Ard, thoughts?
> >>
> >
> >The generic apply_to_page_range() will expect the PTE lock to be
> >initialized for page table pages that are not part of init_mm. For
> >arm64, that is precisely efi_mm as far as I am aware. For EFI, the
> >locking is unnecessary but does no harm (the permissions are set once
> >via apply_to_page_range() at boot), so I added this call when adding
> >support for strict permissions in EFI rt services mappings.
> >
> >So I think it is appropriate for create_pgd_mapping() to be in charge
> >of calling the ctor(). We simply have no destroy_pgd_mapping()
> >counterpart that would be the place for the dtor() call, given that we
> >never take down EFI rt services mappi >
> >Whether it makes sense or not to lock/unlock in apply_to_page_range()
> >is something I did not spend any brain cycles on at the time.
> >
> 
> Agreed there shouldn't be a problem right now. I do think the locking is
> appropriate in apply_to_page_range given what other functions also get
> locked.
> 
> I really wish this were less asymmetrical though since it get hard
> to reason about. It looks like hotplug_paging will call the ctor,
> so is there an issue with calling hot-remove on memory that was once
> hot-added or is that not a concern?
> 
> Thanks,
> Laura

I think the confusion comes from the fact that, in hotplug_paging, we are
passing pgd_pgtable_alloc as the page allocator for __create_pgd_mapping,
which always calls the ctor.

If I got things right (but, please, correct me if I am wrong), we don't
need to get the pte_lock that the ctor gets since - in hotplug - we are
adding to init_mm.

Moreover, I am just realizing that calling the dtor while hot-removing
might create problems when removing memory that *was not* previously
hotplugged, as we are calling a dtor on something that was never
ctor'ed. Is that what you were hinting at, Laura?

Thanks and best regards,
Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] Memory hotplug support for arm64 platform (v2)
  2017-04-11 15:58   ` Mark Rutland
@ 2017-04-24 16:44     ` Maciej Bielski
  2017-04-24 17:35     ` Maciej Bielski
  1 sibling, 0 replies; 18+ messages in thread
From: Maciej Bielski @ 2017-04-24 16:44 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andrea Reale, linux-arm-kernel, f.fainelli, scott.branden,
	will.deacon, linux-kernel, qiuxishi, a.rigo

[-- Attachment #1: /tmp/mutt_random_response_name --]
[-- Type: text/plain, Size: 3941 bytes --]

Hi Mark,

Thank you for having a look on the code and providing comments. Same to others
that replied to the whole patchset. To the large extent, first purpose of this code
was to just to implement the functionality and I am totally aware that it may
be suboptimal at some places and therefore your feedback is very much
appreciated.

More answers below.

On Tue, Apr 11, 2017 at 04:58:43PM +0100, Mark Rutland wrote:
> Hi,
> 
> On Tue, Apr 11, 2017 at 03:55:22PM +0100, Andrea Reale wrote:
> > From: Maciej Bielski <m.bielski@virtualopensystems.com>
> > 
> > This is a second and improved version of the patch previously released
> > in [3].
> > 
> > It builds on the work by Scott Branden [2] and, henceforth,
> > it needs to be applied on top of Scott's patches [2].
> > Comments are very welcome.
> > 
> > Changes from the original patchset and known issues:
> > 
> > - Compared to Scott's original patchset, this work adds the mapping of
> >   the new hotplugged pages into the kernel page tables. This is done by
> >   copying the old swapper_pg_dir over a new page, adding the new mappings,
> >   and then switching to the newly built pg_dir (see `hotplug_paging` in
> >   arch/arm64/mmu.c). There might be better ways to to this: suggestions
> >   are more than welcome.
> 
> For this reply, I'm just going to focus on the PGD replacement.
> 
> It's not clear to me if we need to swap the PGD, since everything we do
> here is strictly additive and non-conflicting, and it should be safe to
> add things to the swapper_pg_dir live.
> 
> However, assuming there's something I've missed, I have some comments
> below.
> 
> [...]

For one CPU it should work, I have quickly checked on QEMU. More concerns
regarding multiple CPUs below(*).

> 
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> > +
> > +/*
> > + * hotplug_paging() is used by memory hotplug to build new page tables
> > + * for hot added memory.
> > + */
> > +void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > +{
> > +
> > +	struct page *pg;
> > +	phys_addr_t pgd_phys = pgd_pgtable_alloc();
> > +	pgd_t *pgd = pgd_set_fixmap(pgd_phys);
> > +
> > +	memcpy(pgd, swapper_pg_dir, PAGE_SIZE);
> 
> s/PAGE_SIZE/PGD_SIZE/ (and below, too).
> 
> See commit 12f043ff2b28fa64 ("arm64: fix warning about swapper_pg_dir
> overflow").
> 

Noted, thanks.

> > +
> > +	__create_pgd_mapping(pgd, start, __phys_to_virt(start), size,
> > +		PAGE_KERNEL, pgd_pgtable_alloc, false);
> > +
> > +	cpu_replace_ttbr1(__va(pgd_phys));
> 
> Other CPUs may be online at this point, and cpu_replace_ttbr1() was only
> intended for UP operation. Other CPUs will still have swapper_pg_dir in
> their TTBR1_EL1 at this point in time...
> 
> > +	memcpy(swapper_pg_dir, pgd, PAGE_SIZE);
> 
> ... which this will alter, in an unsafe fashion.
> 
> The other CPUs are live, and might be altering swapper. e.g. one might
> be using the fixmap to alter code. We will need some stop_machine()-like
> machinery here to synchronise with other CPUs, ensuring that they don't
> have swapper_pg_dir live.

(*)
I am familiar with stop_machine(), for example it is done at later stage, when
pages are onlined (inside build_all_zonelists). But do you think that we should
check if swapper_pg_dir is under modification before stopping (and optionally
wait until this modification is finished)? Is there a mechanism to serialize
the write access to swapper_pg_dir?

> 
> Unfortunately, we can't change other to the temporary pgdir in a cross
> call, then fix things up, then do another cross call. If any exception
> is taken when we're in the temporary pgdir, __uaccess_ttbr0_disable will
> install junk into TTBR0, and we risk the usual set of pain junk TLBs
> entail.
> 
> We appear to have a latent bug at boot time along those lines, when
> setting up the page tables and initialising KASAN. I'll see about
> cleaning that up.

Great, thank you for any hints.

> 
> Thanks,
> Mark.

Best regards,

-- 
Maciej Bielski

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] Memory hotplug support for arm64 platform (v2)
  2017-04-11 15:58   ` Mark Rutland
  2017-04-24 16:44     ` Maciej Bielski
@ 2017-04-24 17:35     ` Maciej Bielski
  1 sibling, 0 replies; 18+ messages in thread
From: Maciej Bielski @ 2017-04-24 17:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andrea Reale, linux-arm-kernel, f.fainelli, scott.branden,
	will.deacon, linux-kernel, qiuxishi, a.rigo

Hi Mark,

Thank you for having a look on the code and providing comments. Same to others
that replied to the whole patchset. To the large extent, first purpose of this code
was to just to implement the functionality and I am totally aware that it may
be suboptimal at some places and therefore your feedback is very much
appreciated.

More answers below.

On Tue, Apr 11, 2017 at 04:58:43PM +0100, Mark Rutland wrote:
> Hi,
> 
> On Tue, Apr 11, 2017 at 03:55:22PM +0100, Andrea Reale wrote:
> > From: Maciej Bielski <m.bielski@virtualopensystems.com>
> > 
> > This is a second and improved version of the patch previously released
> > in [3].
> > 
> > It builds on the work by Scott Branden [2] and, henceforth,
> > it needs to be applied on top of Scott's patches [2].
> > Comments are very welcome.
> > 
> > Changes from the original patchset and known issues:
> > 
> > - Compared to Scott's original patchset, this work adds the mapping of
> >   the new hotplugged pages into the kernel page tables. This is done by
> >   copying the old swapper_pg_dir over a new page, adding the new mappings,
> >   and then switching to the newly built pg_dir (see `hotplug_paging` in
> >   arch/arm64/mmu.c). There might be better ways to to this: suggestions
> >   are more than welcome.
> 
> For this reply, I'm just going to focus on the PGD replacement.
> 
> It's not clear to me if we need to swap the PGD, since everything we do
> here is strictly additive and non-conflicting, and it should be safe to
> add things to the swapper_pg_dir live.
> 
> However, assuming there's something I've missed, I have some comments
> below.
> 
> [...]

For one CPU it should work, I have quickly checked on QEMU. More concerns
regarding multiple CPUs below(*).

> 
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> > +
> > +/*
> > + * hotplug_paging() is used by memory hotplug to build new page tables
> > + * for hot added memory.
> > + */
> > +void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > +{
> > +
> > +	struct page *pg;
> > +	phys_addr_t pgd_phys = pgd_pgtable_alloc();
> > +	pgd_t *pgd = pgd_set_fixmap(pgd_phys);
> > +
> > +	memcpy(pgd, swapper_pg_dir, PAGE_SIZE);
> 
> s/PAGE_SIZE/PGD_SIZE/ (and below, too).
> 
> See commit 12f043ff2b28fa64 ("arm64: fix warning about swapper_pg_dir
> overflow").
> 

Noted, thanks.

> > +
> > +	__create_pgd_mapping(pgd, start, __phys_to_virt(start), size,
> > +		PAGE_KERNEL, pgd_pgtable_alloc, false);
> > +
> > +	cpu_replace_ttbr1(__va(pgd_phys));
> 
> Other CPUs may be online at this point, and cpu_replace_ttbr1() was only
> intended for UP operation. Other CPUs will still have swapper_pg_dir in
> their TTBR1_EL1 at this point in time...
> 
> > +	memcpy(swapper_pg_dir, pgd, PAGE_SIZE);
> 
> ... which this will alter, in an unsafe fashion.
> 
> The other CPUs are live, and might be altering swapper. e.g. one might
> be using the fixmap to alter code. We will need some stop_machine()-like
> machinery here to synchronise with other CPUs, ensuring that they don't
> have swapper_pg_dir live.

(*)
I am familiar with stop_machine(), for example it is done at later stage, when
pages are onlined (inside build_all_zonelists). But do you think that we should
check if swapper_pg_dir is under modification before stopping (and optionally
wait until this modification is finished)? Is there a mechanism to serialize
the write access to swapper_pg_dir?

> 
> Unfortunately, we can't change other to the temporary pgdir in a cross
> call, then fix things up, then do another cross call. If any exception
> is taken when we're in the temporary pgdir, __uaccess_ttbr0_disable will
> install junk into TTBR0, and we risk the usual set of pain junk TLBs
> entail.
> 
> We appear to have a latent bug at boot time along those lines, when
> setting up the page tables and initialising KASAN. I'll see about
> cleaning that up.

Great, thank you for any hints.

> 
> Thanks,
> Mark.

Best regards,

-- 
Maciej Bielski

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] Hot-remove implementation for arm64
  2017-04-21 10:05             ` Andrea Reale
@ 2017-04-24 23:59               ` Laura Abbott
  0 siblings, 0 replies; 18+ messages in thread
From: Laura Abbott @ 2017-04-24 23:59 UTC (permalink / raw)
  To: Andrea Reale
  Cc: Ard Biesheuvel, Mark Rutland, linux-arm-kernel, Florian Fainelli,
	m.bielski, scott.branden, Will Deacon, linux-kernel, Xishi Qiu

On 04/21/2017 03:05 AM, Andrea Reale wrote:
> Hi all,
> 
> thanks for taking the time to comment. Replies in-line.
> 
> On Wed, Apr 19, 2017 at 08:53:13AM -0700, Laura Abbott wrote:
>> On 04/18/2017 11:48 AM, Ard Biesheuvel wrote:
>>> On 18 April 2017 at 19:21, Mark Rutland <mark.rutland@arm.com> wrote:
>>>> On Fri, Apr 14, 2017 at 03:01:58PM +0100, Andrea Reale wrote:
> 
> [...]
> 
>>>>
>>>> From a quick scan, I see that it's necessary to use pgtable_page_ctor()
>>>> for pages that will be used for userspace page tables, but it's not
>>>> clear to me if it's ever necessary for pages used for kernel page
>>>> tables.
>>>>
>>>> If it is, we appear to have a bug on arm64.
>>>>
>>>> Laura, Ard, thoughts?
>>>>
>>>
>>> The generic apply_to_page_range() will expect the PTE lock to be
>>> initialized for page table pages that are not part of init_mm. For
>>> arm64, that is precisely efi_mm as far as I am aware. For EFI, the
>>> locking is unnecessary but does no harm (the permissions are set once
>>> via apply_to_page_range() at boot), so I added this call when adding
>>> support for strict permissions in EFI rt services mappings.
>>>
>>> So I think it is appropriate for create_pgd_mapping() to be in charge
>>> of calling the ctor(). We simply have no destroy_pgd_mapping()
>>> counterpart that would be the place for the dtor() call, given that we
>>> never take down EFI rt services mappi >
>>> Whether it makes sense or not to lock/unlock in apply_to_page_range()
>>> is something I did not spend any brain cycles on at the time.
>>>
>>
>> Agreed there shouldn't be a problem right now. I do think the locking is
>> appropriate in apply_to_page_range given what other functions also get
>> locked.
>>
>> I really wish this were less asymmetrical though since it get hard
>> to reason about. It looks like hotplug_paging will call the ctor,
>> so is there an issue with calling hot-remove on memory that was once
>> hot-added or is that not a concern?
>>
>> Thanks,
>> Laura
> 
> I think the confusion comes from the fact that, in hotplug_paging, we are
> passing pgd_pgtable_alloc as the page allocator for __create_pgd_mapping,
> which always calls the ctor.
> 
> If I got things right (but, please, correct me if I am wrong), we don't
> need to get the pte_lock that the ctor gets since - in hotplug - we are
> adding to init_mm.
> 
> Moreover, I am just realizing that calling the dtor while hot-removing
> might create problems when removing memory that *was not* previously
> hotplugged, as we are calling a dtor on something that was never
> ctor'ed. Is that what you were hinting at, Laura?
> 
> Thanks and best regards,
> Andrea
> 

Yes, that was what I was thinking.

Thanks,
Laura

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-04-25  0:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-11 14:54 [PATCH 0/5] Memory hotplug support for arm64 - complete patchset Andrea Reale
2017-04-11 14:54 ` [PATCH 1/5] arm64: memory-hotplug: Add MEMORY_HOTPLUG, MEMORY_HOTREMOVE, MEMORY_PROBE Andrea Reale
2017-04-12  0:20   ` kbuild test robot
2017-04-11 14:54 ` [PATCH 2/5] arm64: defconfig: enable MEMORY_HOTPLUG config options Andrea Reale
2017-04-11 14:55 ` [PATCH 3/5] Memory hotplug support for arm64 platform (v2) Andrea Reale
2017-04-11 15:58   ` Mark Rutland
2017-04-24 16:44     ` Maciej Bielski
2017-04-24 17:35     ` Maciej Bielski
2017-04-11 14:55 ` [PATCH 4/5] Hot-remove implementation for arm64 Andrea Reale
2017-04-11 17:12   ` Mark Rutland
2017-04-14 14:01     ` Andrea Reale
2017-04-18 18:21       ` Mark Rutland
2017-04-18 18:48         ` Ard Biesheuvel
2017-04-19 15:53           ` Laura Abbott
2017-04-21 10:05             ` Andrea Reale
2017-04-24 23:59               ` Laura Abbott
2017-04-21 10:02         ` Andrea Reale
2017-04-11 14:56 ` [PATCH 5/5] Add "remove" probe driver for memory hot-remove Andrea Reale

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).