linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] s390: implement and optimize vmemmap_free()
@ 2020-07-22  9:45 David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 1/9] s390/vmem: rename vmem_add_mem() to vmem_add_range() David Hildenbrand
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Christian Borntraeger,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik

This series is based on the latest s390/features branch [1]. It
consolidates vmem_add_range(), vmem_remove_range(), and vmemmap_populate()
into a single, recursive page table walker. It then implements
vmemmap_free() and optimizes it by
- Freeing empty page tables (also done for vmem_remove_range()).
- Handling cases where the vmemmap of a section does not fill huge pages
  completely (e.g., sizeof(struct page) == 56).

vmemmap_free() is currently never used, unless adiing standby memory fails
(unlikely). This is relevant for virtio-mem, which adds/removes memory
in memory block/section granularity (always removes memory in the same
granularity it added it).

I gave this a proper test with my virtio-mem prototype (which I will share
in the near future), both with 56 byte memmap per page and 64 byte memmap
per page, with and without huge page support. In both cases, removing
memory (routed through arch_remove_memory()) will result in
- all populated vmemmap pages to get removed/freed
- all applicable page tables for the vmemmap getting removed/freed
- all applicable page tables for the idendity mapping getting removed/freed
Unfortunately, I don't have access to bigger and z/VM (esp. dcss)
environments.

This is the basis for real memory hotunplug support for s390x and should
complete my journey to s390x vmem/vmemmap code for now

What needs double-checking is tlb flushing. AFAIKS, as there are no valid
accesses, doing a single range flush at the end is sufficient, both when
removing vmemmap pages and the idendity mapping.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features

v1 -> v2:
- Convert to a single page table walker named "modify_pagetable()", with
  two helper functions "add_pagetable()" and "remove_pagetable().

David Hildenbrand (9):
  s390/vmem: rename vmem_add_mem() to vmem_add_range()
  s390/vmem: consolidate vmem_add_range() and vmem_remove_range()
  s390/vmemmap: extend modify_pagetable() to handle vmemmap
  s390/vmemmap: cleanup when vmemmap_populate() fails
  s390/vmemmap: take the vmem_mutex when populating/freeing
  s390/vmem: cleanup empty page tables
  s390/vmemmap: fallback to PTEs if mapping large PMD fails
  s390/vmemmap: remember unused sub-pmd ranges
  s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive
    sections

 arch/s390/mm/vmem.c | 637 ++++++++++++++++++++++++++++++--------------
 1 file changed, 442 insertions(+), 195 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/9] s390/vmem: rename vmem_add_mem() to vmem_add_range()
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 2/9] s390/vmem: consolidate vmem_add_range() and vmem_remove_range() David Hildenbrand
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

Let's match the name to vmem_remove_range().

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 3b9e71654c37b..66c5333020ead 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -57,7 +57,7 @@ pte_t __ref *vmem_pte_alloc(void)
 /*
  * Add a physical memory range to the 1:1 mapping.
  */
-static int vmem_add_mem(unsigned long start, unsigned long size)
+static int vmem_add_range(unsigned long start, unsigned long size)
 {
 	unsigned long pgt_prot, sgt_prot, r3_prot;
 	unsigned long pages4k, pages1m, pages2g;
@@ -308,7 +308,7 @@ int vmem_add_mapping(unsigned long start, unsigned long size)
 		return -ERANGE;
 
 	mutex_lock(&vmem_mutex);
-	ret = vmem_add_mem(start, size);
+	ret = vmem_add_range(start, size);
 	if (ret)
 		vmem_remove_range(start, size);
 	mutex_unlock(&vmem_mutex);
@@ -325,7 +325,7 @@ void __init vmem_map_init(void)
 	struct memblock_region *reg;
 
 	for_each_memblock(memory, reg)
-		vmem_add_mem(reg->base, reg->size);
+		vmem_add_range(reg->base, reg->size);
 	__set_memory((unsigned long)_stext,
 		     (unsigned long)(_etext - _stext) >> PAGE_SHIFT,
 		     SET_MEMORY_RO | SET_MEMORY_X);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/9] s390/vmem: consolidate vmem_add_range() and vmem_remove_range()
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 1/9] s390/vmem: rename vmem_add_mem() to vmem_add_range() David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 3/9] s390/vmemmap: extend modify_pagetable() to handle vmemmap David Hildenbrand
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

We want to have only a single pagetable walker and reuse the same
functionality for vmemmap handling. Let's start by consolidating
vmem_add_range() and vmem_remove_range(), converting it into a
recursive implementation.

A recursive implementation makes it easier to expand individual cases
without harming readability. In addition, we minimize traversing the
whole hierarchy over and over again.

One change is that we don't unmap large PMDs/PUDs when not completely
covered by the request, something that should never happen with direct
mappings, unless one would be removing in other granularity than added,
which would be broken already.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 317 +++++++++++++++++++++++++++-----------------
 1 file changed, 198 insertions(+), 119 deletions(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 66c5333020ead..177daf389d391 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -54,148 +54,227 @@ pte_t __ref *vmem_pte_alloc(void)
 	return pte;
 }
 
-/*
- * Add a physical memory range to the 1:1 mapping.
- */
-static int vmem_add_range(unsigned long start, unsigned long size)
+static void modify_pte_table(pmd_t *pmd, unsigned long addr, unsigned long end,
+			    bool add)
 {
-	unsigned long pgt_prot, sgt_prot, r3_prot;
-	unsigned long pages4k, pages1m, pages2g;
-	unsigned long end = start + size;
-	unsigned long address = start;
-	pgd_t *pg_dir;
-	p4d_t *p4_dir;
-	pud_t *pu_dir;
-	pmd_t *pm_dir;
-	pte_t *pt_dir;
-	int ret = -ENOMEM;
+	unsigned long prot, pages = 0;
+	pte_t *pte;
 
-	pgt_prot = pgprot_val(PAGE_KERNEL);
-	sgt_prot = pgprot_val(SEGMENT_KERNEL);
-	r3_prot = pgprot_val(REGION3_KERNEL);
-	if (!MACHINE_HAS_NX) {
-		pgt_prot &= ~_PAGE_NOEXEC;
-		sgt_prot &= ~_SEGMENT_ENTRY_NOEXEC;
-		r3_prot &= ~_REGION_ENTRY_NOEXEC;
+	prot = pgprot_val(PAGE_KERNEL);
+	if (!MACHINE_HAS_NX)
+		prot &= ~_PAGE_NOEXEC;
+
+	pte = pte_offset_kernel(pmd, addr);
+	for (; addr < end; addr += PAGE_SIZE, pte++) {
+		if (!add) {
+			if (pte_none(*pte))
+				continue;
+			pte_clear(&init_mm, addr, pte);
+		} else if (pte_none(*pte)) {
+			pte_val(*pte) = addr | prot;
+		} else
+			continue;
+
+		pages++;
 	}
-	pages4k = pages1m = pages2g = 0;
-	while (address < end) {
-		pg_dir = pgd_offset_k(address);
-		if (pgd_none(*pg_dir)) {
-			p4_dir = vmem_crst_alloc(_REGION2_ENTRY_EMPTY);
-			if (!p4_dir)
-				goto out;
-			pgd_populate(&init_mm, pg_dir, p4_dir);
-		}
-		p4_dir = p4d_offset(pg_dir, address);
-		if (p4d_none(*p4_dir)) {
-			pu_dir = vmem_crst_alloc(_REGION3_ENTRY_EMPTY);
-			if (!pu_dir)
+
+	update_page_count(PG_DIRECT_MAP_4K, add ? pages : -pages);
+}
+
+static int modify_pmd_table(pud_t *pud, unsigned long addr, unsigned long end,
+			    bool add)
+{
+	unsigned long next, prot, pages = 0;
+	int ret = -ENOMEM;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	prot = pgprot_val(SEGMENT_KERNEL);
+	if (!MACHINE_HAS_NX)
+		prot &= ~_SEGMENT_ENTRY_NOEXEC;
+
+	pmd = pmd_offset(pud, addr);
+	for (; addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+
+		if (!add) {
+			if (pmd_none(*pmd))
+				continue;
+			if (pmd_large(*pmd) && !add) {
+				if (IS_ALIGNED(addr, PMD_SIZE) &&
+				    IS_ALIGNED(next, PMD_SIZE)) {
+					pmd_clear(pmd);
+					pages++;
+				}
+				continue;
+			}
+		} else if (pmd_none(*pmd)) {
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+			    IS_ALIGNED(next, PMD_SIZE) &&
+			    MACHINE_HAS_EDAT1 && addr &&
+			    !debug_pagealloc_enabled()) {
+				pmd_val(*pmd) = addr | prot;
+				pages++;
+				continue;
+			}
+			pte = vmem_pte_alloc();
+			if (!pte)
 				goto out;
-			p4d_populate(&init_mm, p4_dir, pu_dir);
-		}
-		pu_dir = pud_offset(p4_dir, address);
-		if (MACHINE_HAS_EDAT2 && pud_none(*pu_dir) && address &&
-		    !(address & ~PUD_MASK) && (address + PUD_SIZE <= end) &&
-		     !debug_pagealloc_enabled()) {
-			pud_val(*pu_dir) = address | r3_prot;
-			address += PUD_SIZE;
-			pages2g++;
+			pmd_populate(&init_mm, pmd, pte);
+		} else if (pmd_large(*pmd))
 			continue;
-		}
-		if (pud_none(*pu_dir)) {
-			pm_dir = vmem_crst_alloc(_SEGMENT_ENTRY_EMPTY);
-			if (!pm_dir)
+
+		modify_pte_table(pmd, addr, next, add);
+	}
+	ret = 0;
+out:
+	update_page_count(PG_DIRECT_MAP_1M, add ? pages : -pages);
+	return ret;
+}
+
+static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
+			    bool add)
+{
+	unsigned long next, prot, pages = 0;
+	int ret = -ENOMEM;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	prot = pgprot_val(REGION3_KERNEL);
+	if (!MACHINE_HAS_NX)
+		prot &= ~_REGION_ENTRY_NOEXEC;
+
+	pud = pud_offset(p4d, addr);
+	for (; addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+
+		if (!add) {
+			if (pud_none(*pud))
+				continue;
+			if (pud_large(*pud)) {
+				if (IS_ALIGNED(addr, PUD_SIZE) &&
+				    IS_ALIGNED(next, PUD_SIZE)) {
+					pud_clear(pud);
+					pages++;
+				}
+				continue;
+			}
+		} else if (pud_none(*pud)) {
+			if (IS_ALIGNED(addr, PUD_SIZE) &&
+			    IS_ALIGNED(next, PUD_SIZE) &&
+			    MACHINE_HAS_EDAT2 && addr &&
+			    !debug_pagealloc_enabled()) {
+				pud_val(*pud) = addr | prot;
+				pages++;
+				continue;
+			}
+			pmd = vmem_crst_alloc(_SEGMENT_ENTRY_EMPTY);
+			if (!pmd)
 				goto out;
-			pud_populate(&init_mm, pu_dir, pm_dir);
-		}
-		pm_dir = pmd_offset(pu_dir, address);
-		if (MACHINE_HAS_EDAT1 && pmd_none(*pm_dir) && address &&
-		    !(address & ~PMD_MASK) && (address + PMD_SIZE <= end) &&
-		    !debug_pagealloc_enabled()) {
-			pmd_val(*pm_dir) = address | sgt_prot;
-			address += PMD_SIZE;
-			pages1m++;
+			pud_populate(&init_mm, pud, pmd);
+		} else if (pud_large(*pud))
 			continue;
+
+		ret = modify_pmd_table(pud, addr, next, add);
+		if (ret)
+			goto out;
+	}
+	ret = 0;
+out:
+	update_page_count(PG_DIRECT_MAP_2G, add ? pages : -pages);
+	return ret;
+}
+
+static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
+			    bool add)
+{
+	unsigned long next;
+	int ret = -ENOMEM;
+	p4d_t *p4d;
+	pud_t *pud;
+
+	p4d = p4d_offset(pgd, addr);
+	for (; addr < end; addr = next, p4d++) {
+		next = p4d_addr_end(addr, end);
+
+		if (!add) {
+			if (p4d_none(*p4d))
+				continue;
+		} else if (p4d_none(*p4d)) {
+			pud = vmem_crst_alloc(_REGION3_ENTRY_EMPTY);
+			if (!pud)
+				goto out;
 		}
-		if (pmd_none(*pm_dir)) {
-			pt_dir = vmem_pte_alloc();
-			if (!pt_dir)
+
+		ret = modify_pud_table(p4d, addr, next, add);
+		if (ret)
+			goto out;
+	}
+	ret = 0;
+out:
+	return ret;
+}
+
+static int modify_pagetable(unsigned long start, unsigned long end, bool add)
+{
+	unsigned long addr, next;
+	int ret = -ENOMEM;
+	pgd_t *pgd;
+	p4d_t *p4d;
+
+	if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
+		return -EINVAL;
+
+	for (addr = start; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+		pgd = pgd_offset_k(addr);
+
+		if (!add) {
+			if (pgd_none(*pgd))
+				continue;
+		} else if (pgd_none(*pgd)) {
+			p4d = vmem_crst_alloc(_REGION2_ENTRY_EMPTY);
+			if (!p4d)
 				goto out;
-			pmd_populate(&init_mm, pm_dir, pt_dir);
+			pgd_populate(&init_mm, pgd, p4d);
 		}
 
-		pt_dir = pte_offset_kernel(pm_dir, address);
-		pte_val(*pt_dir) = address | pgt_prot;
-		address += PAGE_SIZE;
-		pages4k++;
+		ret = modify_p4d_table(pgd, addr, next, add);
+		if (ret)
+			goto out;
 	}
 	ret = 0;
 out:
-	update_page_count(PG_DIRECT_MAP_4K, pages4k);
-	update_page_count(PG_DIRECT_MAP_1M, pages1m);
-	update_page_count(PG_DIRECT_MAP_2G, pages2g);
+	if (!add)
+		flush_tlb_kernel_range(start, end);
 	return ret;
 }
 
+static int add_pagetable(unsigned long start, unsigned long end)
+{
+	return modify_pagetable(start, end, true);
+}
+
+static int remove_pagetable(unsigned long start, unsigned long end)
+{
+	return modify_pagetable(start, end, false);
+}
+
+/*
+ * Add a physical memory range to the 1:1 mapping.
+ */
+static int vmem_add_range(unsigned long start, unsigned long size)
+{
+	return add_pagetable(start, start + size);
+}
+
 /*
  * Remove a physical memory range from the 1:1 mapping.
  * Currently only invalidates page table entries.
  */
 static void vmem_remove_range(unsigned long start, unsigned long size)
 {
-	unsigned long pages4k, pages1m, pages2g;
-	unsigned long end = start + size;
-	unsigned long address = start;
-	pgd_t *pg_dir;
-	p4d_t *p4_dir;
-	pud_t *pu_dir;
-	pmd_t *pm_dir;
-	pte_t *pt_dir;
-
-	pages4k = pages1m = pages2g = 0;
-	while (address < end) {
-		pg_dir = pgd_offset_k(address);
-		if (pgd_none(*pg_dir)) {
-			address += PGDIR_SIZE;
-			continue;
-		}
-		p4_dir = p4d_offset(pg_dir, address);
-		if (p4d_none(*p4_dir)) {
-			address += P4D_SIZE;
-			continue;
-		}
-		pu_dir = pud_offset(p4_dir, address);
-		if (pud_none(*pu_dir)) {
-			address += PUD_SIZE;
-			continue;
-		}
-		if (pud_large(*pu_dir)) {
-			pud_clear(pu_dir);
-			address += PUD_SIZE;
-			pages2g++;
-			continue;
-		}
-		pm_dir = pmd_offset(pu_dir, address);
-		if (pmd_none(*pm_dir)) {
-			address += PMD_SIZE;
-			continue;
-		}
-		if (pmd_large(*pm_dir)) {
-			pmd_clear(pm_dir);
-			address += PMD_SIZE;
-			pages1m++;
-			continue;
-		}
-		pt_dir = pte_offset_kernel(pm_dir, address);
-		pte_clear(&init_mm, address, pt_dir);
-		address += PAGE_SIZE;
-		pages4k++;
-	}
-	flush_tlb_kernel_range(start, end);
-	update_page_count(PG_DIRECT_MAP_4K, -pages4k);
-	update_page_count(PG_DIRECT_MAP_1M, -pages1m);
-	update_page_count(PG_DIRECT_MAP_2G, -pages2g);
+	remove_pagetable(start, start + size);
 }
 
 /*
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/9] s390/vmemmap: extend modify_pagetable() to handle vmemmap
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 1/9] s390/vmem: rename vmem_add_mem() to vmem_add_range() David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 2/9] s390/vmem: consolidate vmem_add_range() and vmem_remove_range() David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 4/9] s390/vmemmap: cleanup when vmemmap_populate() fails David Hildenbrand
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

Extend our shiny new modify_pagetable() to handle !direct (vmemmap)
mappings. Convert vmemmap_populate() and implement vmemmap_free().

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 181 +++++++++++++++++++-------------------------
 1 file changed, 76 insertions(+), 105 deletions(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 177daf389d391..43fe1e2eb90ea 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -29,6 +29,15 @@ static void __ref *vmem_alloc_pages(unsigned int order)
 	return (void *) memblock_phys_alloc(size, size);
 }
 
+static void vmem_free_pages(unsigned long addr, int order)
+{
+	/* We don't expect boot memory to be removed ever. */
+	if (!slab_is_available() ||
+	    WARN_ON_ONCE(PageReserved(phys_to_page(addr))))
+		return;
+	free_pages(addr, order);
+}
+
 void *vmem_crst_alloc(unsigned long val)
 {
 	unsigned long *table;
@@ -54,10 +63,12 @@ pte_t __ref *vmem_pte_alloc(void)
 	return pte;
 }
 
-static void modify_pte_table(pmd_t *pmd, unsigned long addr, unsigned long end,
-			    bool add)
+/* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
+static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
+				  unsigned long end, bool add, bool direct)
 {
 	unsigned long prot, pages = 0;
+	int ret = -ENOMEM;
 	pte_t *pte;
 
 	prot = pgprot_val(PAGE_KERNEL);
@@ -69,20 +80,34 @@ static void modify_pte_table(pmd_t *pmd, unsigned long addr, unsigned long end,
 		if (!add) {
 			if (pte_none(*pte))
 				continue;
+			if (!direct)
+				vmem_free_pages(pfn_to_phys(pte_pfn(*pte)), 0);
 			pte_clear(&init_mm, addr, pte);
 		} else if (pte_none(*pte)) {
-			pte_val(*pte) = addr | prot;
+			if (!direct) {
+				void *new_page = vmemmap_alloc_block(PAGE_SIZE,
+								     NUMA_NO_NODE);
+
+				if (!new_page)
+					goto out;
+				pte_val(*pte) = __pa(new_page) | prot;
+			} else
+				pte_val(*pte) = addr | prot;
 		} else
 			continue;
 
 		pages++;
 	}
-
-	update_page_count(PG_DIRECT_MAP_4K, add ? pages : -pages);
+	ret = 0;
+out:
+	if (direct)
+		update_page_count(PG_DIRECT_MAP_4K, add ? pages : -pages);
+	return ret;
 }
 
-static int modify_pmd_table(pud_t *pud, unsigned long addr, unsigned long end,
-			    bool add)
+/* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
+static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
+				  unsigned long end, bool add, bool direct)
 {
 	unsigned long next, prot, pages = 0;
 	int ret = -ENOMEM;
@@ -103,6 +128,9 @@ static int modify_pmd_table(pud_t *pud, unsigned long addr, unsigned long end,
 			if (pmd_large(*pmd) && !add) {
 				if (IS_ALIGNED(addr, PMD_SIZE) &&
 				    IS_ALIGNED(next, PMD_SIZE)) {
+					if (!direct)
+						vmem_free_pages(pmd_deref(*pmd),
+								get_order(PMD_SIZE));
 					pmd_clear(pmd);
 					pages++;
 				}
@@ -111,11 +139,27 @@ static int modify_pmd_table(pud_t *pud, unsigned long addr, unsigned long end,
 		} else if (pmd_none(*pmd)) {
 			if (IS_ALIGNED(addr, PMD_SIZE) &&
 			    IS_ALIGNED(next, PMD_SIZE) &&
-			    MACHINE_HAS_EDAT1 && addr &&
+			    MACHINE_HAS_EDAT1 && addr && direct &&
 			    !debug_pagealloc_enabled()) {
 				pmd_val(*pmd) = addr | prot;
 				pages++;
 				continue;
+			} else if (!direct && MACHINE_HAS_EDAT1) {
+				void *new_page;
+
+				/*
+				 * Use 1MB frames for vmemmap if available. We
+				 * always use large frames even if they are only
+				 * partially used. Otherwise we would have also
+				 * page tables since vmemmap_populate gets
+				 * called for each section separately.
+				 */
+				new_page = vmemmap_alloc_block(PMD_SIZE,
+							       NUMA_NO_NODE);
+				if (!new_page)
+					goto out;
+				pmd_val(*pmd) = __pa(new_page) | prot;
+				continue;
 			}
 			pte = vmem_pte_alloc();
 			if (!pte)
@@ -124,16 +168,19 @@ static int modify_pmd_table(pud_t *pud, unsigned long addr, unsigned long end,
 		} else if (pmd_large(*pmd))
 			continue;
 
-		modify_pte_table(pmd, addr, next, add);
+		ret = modify_pte_table(pmd, addr, next, add, direct);
+		if (ret)
+			goto out;
 	}
 	ret = 0;
 out:
-	update_page_count(PG_DIRECT_MAP_1M, add ? pages : -pages);
+	if (direct)
+		update_page_count(PG_DIRECT_MAP_1M, add ? pages : -pages);
 	return ret;
 }
 
 static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
-			    bool add)
+			    bool add, bool direct)
 {
 	unsigned long next, prot, pages = 0;
 	int ret = -ENOMEM;
@@ -162,7 +209,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		} else if (pud_none(*pud)) {
 			if (IS_ALIGNED(addr, PUD_SIZE) &&
 			    IS_ALIGNED(next, PUD_SIZE) &&
-			    MACHINE_HAS_EDAT2 && addr &&
+			    MACHINE_HAS_EDAT2 && addr && direct &&
 			    !debug_pagealloc_enabled()) {
 				pud_val(*pud) = addr | prot;
 				pages++;
@@ -175,18 +222,19 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		} else if (pud_large(*pud))
 			continue;
 
-		ret = modify_pmd_table(pud, addr, next, add);
+		ret = modify_pmd_table(pud, addr, next, add, direct);
 		if (ret)
 			goto out;
 	}
 	ret = 0;
 out:
-	update_page_count(PG_DIRECT_MAP_2G, add ? pages : -pages);
+	if (direct)
+		update_page_count(PG_DIRECT_MAP_2G, add ? pages : -pages);
 	return ret;
 }
 
 static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
-			    bool add)
+			    bool add, bool direct)
 {
 	unsigned long next;
 	int ret = -ENOMEM;
@@ -206,7 +254,7 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 				goto out;
 		}
 
-		ret = modify_pud_table(p4d, addr, next, add);
+		ret = modify_pud_table(p4d, addr, next, add, direct);
 		if (ret)
 			goto out;
 	}
@@ -215,7 +263,8 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 	return ret;
 }
 
-static int modify_pagetable(unsigned long start, unsigned long end, bool add)
+static int modify_pagetable(unsigned long start, unsigned long end, bool add,
+			    bool direct)
 {
 	unsigned long addr, next;
 	int ret = -ENOMEM;
@@ -239,7 +288,7 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add)
 			pgd_populate(&init_mm, pgd, p4d);
 		}
 
-		ret = modify_p4d_table(pgd, addr, next, add);
+		ret = modify_p4d_table(pgd, addr, next, add, direct);
 		if (ret)
 			goto out;
 	}
@@ -250,14 +299,14 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add)
 	return ret;
 }
 
-static int add_pagetable(unsigned long start, unsigned long end)
+static int add_pagetable(unsigned long start, unsigned long end, bool direct)
 {
-	return modify_pagetable(start, end, true);
+	return modify_pagetable(start, end, true, direct);
 }
 
-static int remove_pagetable(unsigned long start, unsigned long end)
+static int remove_pagetable(unsigned long start, unsigned long end, bool direct)
 {
-	return modify_pagetable(start, end, false);
+	return modify_pagetable(start, end, false, direct);
 }
 
 /*
@@ -265,7 +314,7 @@ static int remove_pagetable(unsigned long start, unsigned long end)
  */
 static int vmem_add_range(unsigned long start, unsigned long size)
 {
-	return add_pagetable(start, start + size);
+	return add_pagetable(start, start + size, true);
 }
 
 /*
@@ -274,7 +323,7 @@ static int vmem_add_range(unsigned long start, unsigned long size)
  */
 static void vmem_remove_range(unsigned long start, unsigned long size)
 {
-	remove_pagetable(start, start + size);
+	remove_pagetable(start, start + size, true);
 }
 
 /*
@@ -283,92 +332,14 @@ static void vmem_remove_range(unsigned long start, unsigned long size)
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
-	unsigned long pgt_prot, sgt_prot;
-	unsigned long address = start;
-	pgd_t *pg_dir;
-	p4d_t *p4_dir;
-	pud_t *pu_dir;
-	pmd_t *pm_dir;
-	pte_t *pt_dir;
-	int ret = -ENOMEM;
-
-	pgt_prot = pgprot_val(PAGE_KERNEL);
-	sgt_prot = pgprot_val(SEGMENT_KERNEL);
-	if (!MACHINE_HAS_NX) {
-		pgt_prot &= ~_PAGE_NOEXEC;
-		sgt_prot &= ~_SEGMENT_ENTRY_NOEXEC;
-	}
-	for (address = start; address < end;) {
-		pg_dir = pgd_offset_k(address);
-		if (pgd_none(*pg_dir)) {
-			p4_dir = vmem_crst_alloc(_REGION2_ENTRY_EMPTY);
-			if (!p4_dir)
-				goto out;
-			pgd_populate(&init_mm, pg_dir, p4_dir);
-		}
-
-		p4_dir = p4d_offset(pg_dir, address);
-		if (p4d_none(*p4_dir)) {
-			pu_dir = vmem_crst_alloc(_REGION3_ENTRY_EMPTY);
-			if (!pu_dir)
-				goto out;
-			p4d_populate(&init_mm, p4_dir, pu_dir);
-		}
-
-		pu_dir = pud_offset(p4_dir, address);
-		if (pud_none(*pu_dir)) {
-			pm_dir = vmem_crst_alloc(_SEGMENT_ENTRY_EMPTY);
-			if (!pm_dir)
-				goto out;
-			pud_populate(&init_mm, pu_dir, pm_dir);
-		}
-
-		pm_dir = pmd_offset(pu_dir, address);
-		if (pmd_none(*pm_dir)) {
-			/* Use 1MB frames for vmemmap if available. We always
-			 * use large frames even if they are only partially
-			 * used.
-			 * Otherwise we would have also page tables since
-			 * vmemmap_populate gets called for each section
-			 * separately. */
-			if (MACHINE_HAS_EDAT1) {
-				void *new_page;
-
-				new_page = vmemmap_alloc_block(PMD_SIZE, node);
-				if (!new_page)
-					goto out;
-				pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
-				address = (address + PMD_SIZE) & PMD_MASK;
-				continue;
-			}
-			pt_dir = vmem_pte_alloc();
-			if (!pt_dir)
-				goto out;
-			pmd_populate(&init_mm, pm_dir, pt_dir);
-		} else if (pmd_large(*pm_dir)) {
-			address = (address + PMD_SIZE) & PMD_MASK;
-			continue;
-		}
-
-		pt_dir = pte_offset_kernel(pm_dir, address);
-		if (pte_none(*pt_dir)) {
-			void *new_page;
-
-			new_page = vmemmap_alloc_block(PAGE_SIZE, node);
-			if (!new_page)
-				goto out;
-			pte_val(*pt_dir) = __pa(new_page) | pgt_prot;
-		}
-		address += PAGE_SIZE;
-	}
-	ret = 0;
-out:
-	return ret;
+	/* We don't care about the node, just use NUMA_NO_NODE on allocations */
+	return add_pagetable(start, end, false);
 }
 
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+	remove_pagetable(start, end, false);
 }
 
 void vmem_remove_mapping(unsigned long start, unsigned long size)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 4/9] s390/vmemmap: cleanup when vmemmap_populate() fails
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
                   ` (2 preceding siblings ...)
  2020-07-22  9:45 ` [PATCH v2 3/9] s390/vmemmap: extend modify_pagetable() to handle vmemmap David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 5/9] s390/vmemmap: take the vmem_mutex when populating/freeing David Hildenbrand
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

Cleanup what we partially added in case vmemmap_populate() fails. For
vmem, this is already handled by vmem_add_mapping().

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 43fe1e2eb90ea..be32a38bb91fd 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -332,8 +332,13 @@ static void vmem_remove_range(unsigned long start, unsigned long size)
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
+	int ret;
+
 	/* We don't care about the node, just use NUMA_NO_NODE on allocations */
-	return add_pagetable(start, end, false);
+	ret = add_pagetable(start, end, false);
+	if (ret)
+		remove_pagetable(start, end, false);
+	return ret;
 }
 
 void vmemmap_free(unsigned long start, unsigned long end,
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 5/9] s390/vmemmap: take the vmem_mutex when populating/freeing
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
                   ` (3 preceding siblings ...)
  2020-07-22  9:45 ` [PATCH v2 4/9] s390/vmemmap: cleanup when vmemmap_populate() fails David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 6/9] s390/vmem: cleanup empty page tables David Hildenbrand
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

Let's synchronize all accesses to the 1:1 and vmemmap mappings. This will
be especially relevant when wanting to cleanup empty page tables that could
be shared by both. Avoid races when removing tables that might be just
about to get reused.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index be32a38bb91fd..a2b79681df69d 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -334,17 +334,21 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 {
 	int ret;
 
+	mutex_lock(&vmem_mutex);
 	/* We don't care about the node, just use NUMA_NO_NODE on allocations */
 	ret = add_pagetable(start, end, false);
 	if (ret)
 		remove_pagetable(start, end, false);
+	mutex_unlock(&vmem_mutex);
 	return ret;
 }
 
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+	mutex_lock(&vmem_mutex);
 	remove_pagetable(start, end, false);
+	mutex_unlock(&vmem_mutex);
 }
 
 void vmem_remove_mapping(unsigned long start, unsigned long size)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 6/9] s390/vmem: cleanup empty page tables
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
                   ` (4 preceding siblings ...)
  2020-07-22  9:45 ` [PATCH v2 5/9] s390/vmemmap: take the vmem_mutex when populating/freeing David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 7/9] s390/vmemmap: fallback to PTEs if mapping large PMD fails David Hildenbrand
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

Let's cleanup empty page tables. Consider only page tables that fully
fall into the idendity mapping and the vmemmap range.

As there are no valid accesses to vmem/vmemmap within non-populated ranges,
the single tlb flush at the end should be sufficient.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 102 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 101 insertions(+), 1 deletion(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index a2b79681df69d..b831f9f9130aa 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -63,6 +63,15 @@ pte_t __ref *vmem_pte_alloc(void)
 	return pte;
 }
 
+static void vmem_pte_free(unsigned long *table)
+{
+	/* We don't expect boot memory to be removed ever. */
+	if (!slab_is_available() ||
+	    WARN_ON_ONCE(PageReserved(virt_to_page(table))))
+		return;
+	page_table_free(&init_mm, table);
+}
+
 /* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
 static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, bool add, bool direct)
@@ -105,6 +114,21 @@ static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
 	return ret;
 }
 
+static void try_free_pte_table(pmd_t *pmd, unsigned long start)
+{
+	pte_t *pte;
+	int i;
+
+	/* We can safely assume this is fully in 1:1 mapping & vmemmap area */
+	pte = pte_offset_kernel(pmd, start);
+	for (i = 0; i < PTRS_PER_PTE; i++, pte++)
+		if (!pte_none(*pte))
+			return;
+
+	vmem_pte_free(__va(pmd_deref(*pmd)));
+	pmd_clear(pmd);
+}
+
 /* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
 static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 				  unsigned long end, bool add, bool direct)
@@ -171,6 +195,8 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 		ret = modify_pte_table(pmd, addr, next, add, direct);
 		if (ret)
 			goto out;
+		if (!add)
+			try_free_pte_table(pmd, addr & PMD_MASK);
 	}
 	ret = 0;
 out:
@@ -179,6 +205,29 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 	return ret;
 }
 
+static void try_free_pmd_table(pud_t *pud, unsigned long start)
+{
+	const unsigned long end = start + PUD_SIZE;
+	pmd_t *pmd;
+	int i;
+
+	/* Don't mess with any tables not fully in 1:1 mapping & vmemmap area */
+	if (end > VMALLOC_START)
+		return;
+#ifdef CONFIG_KASAN
+	if (start < KASAN_SHADOW_END && KASAN_SHADOW_START > end)
+		return;
+#endif
+
+	pmd = pmd_offset(pud, start);
+	for (i = 0; i < PTRS_PER_PMD; i++, pmd++)
+		if (!pmd_none(*pmd))
+			return;
+
+	vmem_free_pages(pud_deref(*pud), CRST_ALLOC_ORDER);
+	pud_clear(pud);
+}
+
 static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 			    bool add, bool direct)
 {
@@ -225,6 +274,8 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 		ret = modify_pmd_table(pud, addr, next, add, direct);
 		if (ret)
 			goto out;
+		if (!add)
+			try_free_pmd_table(pud, addr & PUD_MASK);
 	}
 	ret = 0;
 out:
@@ -233,6 +284,29 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
 	return ret;
 }
 
+static void try_free_pud_table(p4d_t *p4d, unsigned long start)
+{
+	const unsigned long end = start + P4D_SIZE;
+	pud_t *pud;
+	int i;
+
+	/* Don't mess with any tables not fully in 1:1 mapping & vmemmap area */
+	if (end > VMALLOC_START)
+		return;
+#ifdef CONFIG_KASAN
+	if (start < KASAN_SHADOW_END && KASAN_SHADOW_START > end)
+		return;
+#endif
+
+	pud = pud_offset(p4d, start);
+	for (i = 0; i < PTRS_PER_PUD; i++, pud++)
+		if (!pud_none(*pud))
+			return;
+
+	vmem_free_pages(p4d_deref(*p4d), CRST_ALLOC_ORDER);
+	p4d_clear(p4d);
+}
+
 static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 			    bool add, bool direct)
 {
@@ -257,12 +331,37 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
 		ret = modify_pud_table(p4d, addr, next, add, direct);
 		if (ret)
 			goto out;
+		if (!add)
+			try_free_pud_table(p4d, addr & P4D_MASK);
 	}
 	ret = 0;
 out:
 	return ret;
 }
 
+static void try_free_p4d_table(pgd_t *pgd, unsigned long start)
+{
+	const unsigned long end = start + PGDIR_SIZE;
+	p4d_t *p4d;
+	int i;
+
+	/* Don't mess with any tables not fully in 1:1 mapping & vmemmap area */
+	if (end > VMALLOC_START)
+		return;
+#ifdef CONFIG_KASAN
+	if (start < KASAN_SHADOW_END && KASAN_SHADOW_START > end)
+		return;
+#endif
+
+	p4d = p4d_offset(pgd, start);
+	for (i = 0; i < PTRS_PER_P4D; i++, p4d++)
+		if (!p4d_none(*p4d))
+			return;
+
+	vmem_free_pages(pgd_deref(*pgd), CRST_ALLOC_ORDER);
+	pgd_clear(pgd);
+}
+
 static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 			    bool direct)
 {
@@ -291,6 +390,8 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
 		ret = modify_p4d_table(pgd, addr, next, add, direct);
 		if (ret)
 			goto out;
+		if (!add)
+			try_free_p4d_table(pgd, addr & PGDIR_MASK);
 	}
 	ret = 0;
 out:
@@ -319,7 +420,6 @@ static int vmem_add_range(unsigned long start, unsigned long size)
 
 /*
  * Remove a physical memory range from the 1:1 mapping.
- * Currently only invalidates page table entries.
  */
 static void vmem_remove_range(unsigned long start, unsigned long size)
 {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 7/9] s390/vmemmap: fallback to PTEs if mapping large PMD fails
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
                   ` (5 preceding siblings ...)
  2020-07-22  9:45 ` [PATCH v2 6/9] s390/vmem: cleanup empty page tables David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 8/9] s390/vmemmap: remember unused sub-pmd ranges David Hildenbrand
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

Let's fallback to single pages if short on huge pages. No need to stop
memory hotplug.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index b831f9f9130aa..e82a63de19db2 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -180,10 +180,10 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 				 */
 				new_page = vmemmap_alloc_block(PMD_SIZE,
 							       NUMA_NO_NODE);
-				if (!new_page)
-					goto out;
-				pmd_val(*pmd) = __pa(new_page) | prot;
-				continue;
+				if (new_page) {
+					pmd_val(*pmd) = __pa(new_page) | prot;
+					continue;
+				}
 			}
 			pte = vmem_pte_alloc();
 			if (!pte)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 8/9] s390/vmemmap: remember unused sub-pmd ranges
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
                   ` (6 preceding siblings ...)
  2020-07-22  9:45 ` [PATCH v2 7/9] s390/vmemmap: fallback to PTEs if mapping large PMD fails David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-22  9:45 ` [PATCH v2 9/9] s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive sections David Hildenbrand
  2020-07-24 14:32 ` [PATCH v2 0/9] s390: implement and optimize vmemmap_free() Heiko Carstens
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

With a memmap size of 56 bytes or 72 bytes per page, the memmap for a
256 MB section won't span full PMDs. As we populate single sections and
depopulate single sections, the depopulation step would not be able to
free all vmemmap pmds anymore.

Do it similarly to x86, marking the unused memmap ranges in a special way
(pad it with 0xFD).

This allows us to add/remove sections, cleaning up all allocated
vmemmap pages even if the memmap size is not multiple of 16 bytes per page.

A 56 byte memmap can, for example, be created with !CONFIG_MEMCG and
!CONFIG_SLUB.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index e82a63de19db2..df361bbacda1b 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -72,6 +72,42 @@ static void vmem_pte_free(unsigned long *table)
 	page_table_free(&init_mm, table);
 }
 
+#define PAGE_UNUSED 0xFD
+
+static void vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
+{
+	/*
+	 * As we expect to add in the same granularity as we remove, it's
+	 * sufficient to mark only some piece used to block the memmap page from
+	 * getting removed (just in case the memmap never gets initialized,
+	 * e.g., because the memory block never gets onlined).
+	 */
+	memset(__va(start), 0, sizeof(struct page));
+}
+
+static void vmemmap_use_new_sub_pmd(unsigned long start, unsigned long end)
+{
+	void *page = __va(ALIGN_DOWN(start, PMD_SIZE));
+
+	/* Could be our memmap page is filled with PAGE_UNUSED already ... */
+	vmemmap_use_sub_pmd(start, end);
+
+	/* Mark the unused parts of the new memmap page PAGE_UNUSED. */
+	if (!IS_ALIGNED(start, PMD_SIZE))
+		memset(page, PAGE_UNUSED, start - __pa(page));
+	if (!IS_ALIGNED(end, PMD_SIZE))
+		memset(__va(end), PAGE_UNUSED, __pa(page) + PMD_SIZE - end);
+}
+
+/* Returns true if the PMD is completely unused and can be freed. */
+static bool vmemmap_unuse_sub_pmd(unsigned long start, unsigned long end)
+{
+	void *page = __va(ALIGN_DOWN(start, PMD_SIZE));
+
+	memset(__va(start), PAGE_UNUSED, end - start);
+	return !memchr_inv(page, PAGE_UNUSED, PMD_SIZE);
+}
+
 /* __ref: we'll only call vmemmap_alloc_block() via vmemmap_populate() */
 static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, bool add, bool direct)
@@ -157,6 +193,11 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 								get_order(PMD_SIZE));
 					pmd_clear(pmd);
 					pages++;
+				} else if (!direct &&
+					   vmemmap_unuse_sub_pmd(addr, next)) {
+					vmem_free_pages(pmd_deref(*pmd),
+							get_order(PMD_SIZE));
+					pmd_clear(pmd);
 				}
 				continue;
 			}
@@ -182,6 +223,11 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 							       NUMA_NO_NODE);
 				if (new_page) {
 					pmd_val(*pmd) = __pa(new_page) | prot;
+					if (!IS_ALIGNED(addr, PMD_SIZE) ||
+					    !IS_ALIGNED(next, PMD_SIZE)) {
+						vmemmap_use_new_sub_pmd(addr,
+									next);
+					}
 					continue;
 				}
 			}
@@ -189,8 +235,11 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
 			if (!pte)
 				goto out;
 			pmd_populate(&init_mm, pmd, pte);
-		} else if (pmd_large(*pmd))
+		} else if (pmd_large(*pmd)) {
+			if (!direct)
+				vmemmap_use_sub_pmd(addr, next);
 			continue;
+		}
 
 		ret = modify_pte_table(pmd, addr, next, add, direct);
 		if (ret)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 9/9] s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive sections
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
                   ` (7 preceding siblings ...)
  2020-07-22  9:45 ` [PATCH v2 8/9] s390/vmemmap: remember unused sub-pmd ranges David Hildenbrand
@ 2020-07-22  9:45 ` David Hildenbrand
  2020-07-24 14:32 ` [PATCH v2 0/9] s390: implement and optimize vmemmap_free() Heiko Carstens
  9 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2020-07-22  9:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, linux-mm, David Hildenbrand, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Gerald Schaefer

Let's avoid memset(PAGE_UNUSED) when adding consecutive sections,
whereby the vmemmap of a single section does not span full PMDs.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/vmem.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index df361bbacda1b..70ebfc7958a68 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -74,7 +74,22 @@ static void vmem_pte_free(unsigned long *table)
 
 #define PAGE_UNUSED 0xFD
 
-static void vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
+/*
+ * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
+ * from unused_pmd_start to next PMD_SIZE boundary.
+ */
+static unsigned long unused_pmd_start;
+
+static void vmemmap_flush_unused_pmd(void)
+{
+	if (!unused_pmd_start)
+		return;
+	memset(__va(unused_pmd_start), PAGE_UNUSED,
+	       ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start);
+	unused_pmd_start = 0;
+}
+
+static void __vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
 {
 	/*
 	 * As we expect to add in the same granularity as we remove, it's
@@ -85,18 +100,41 @@ static void vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
 	memset(__va(start), 0, sizeof(struct page));
 }
 
+static void vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
+{
+	/*
+	 * We only optimize if the new used range directly follows the
+	 * previously unused range (esp., when populating consecutive sections).
+	 */
+	if (unused_pmd_start == start) {
+		unused_pmd_start = end;
+		if (likely(IS_ALIGNED(unused_pmd_start, PMD_SIZE)))
+			unused_pmd_start = 0;
+		return;
+	}
+	vmemmap_flush_unused_pmd();
+	__vmemmap_use_sub_pmd(start, end);
+}
+
 static void vmemmap_use_new_sub_pmd(unsigned long start, unsigned long end)
 {
 	void *page = __va(ALIGN_DOWN(start, PMD_SIZE));
 
+	vmemmap_flush_unused_pmd();
+
 	/* Could be our memmap page is filled with PAGE_UNUSED already ... */
-	vmemmap_use_sub_pmd(start, end);
+	__vmemmap_use_sub_pmd(start, end);
 
 	/* Mark the unused parts of the new memmap page PAGE_UNUSED. */
 	if (!IS_ALIGNED(start, PMD_SIZE))
 		memset(page, PAGE_UNUSED, start - __pa(page));
+	/*
+	 * We want to avoid memset(PAGE_UNUSED) when populating the vmemmap of
+	 * consecutive sections. Remember for the last added PMD the last
+	 * unused range in the populated PMD.
+	 */
 	if (!IS_ALIGNED(end, PMD_SIZE))
-		memset(__va(end), PAGE_UNUSED, __pa(page) + PMD_SIZE - end);
+		unused_pmd_start = end;
 }
 
 /* Returns true if the PMD is completely unused and can be freed. */
@@ -104,6 +142,7 @@ static bool vmemmap_unuse_sub_pmd(unsigned long start, unsigned long end)
 {
 	void *page = __va(ALIGN_DOWN(start, PMD_SIZE));
 
+	vmemmap_flush_unused_pmd();
 	memset(__va(start), PAGE_UNUSED, end - start);
 	return !memchr_inv(page, PAGE_UNUSED, PMD_SIZE);
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/9] s390: implement and optimize vmemmap_free()
  2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
                   ` (8 preceding siblings ...)
  2020-07-22  9:45 ` [PATCH v2 9/9] s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive sections David Hildenbrand
@ 2020-07-24 14:32 ` Heiko Carstens
  9 siblings, 0 replies; 11+ messages in thread
From: Heiko Carstens @ 2020-07-24 14:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-s390, linux-mm, Christian Borntraeger,
	Gerald Schaefer, Vasily Gorbik

On Wed, Jul 22, 2020 at 11:45:49AM +0200, David Hildenbrand wrote:
> This series is based on the latest s390/features branch [1]. It
> consolidates vmem_add_range(), vmem_remove_range(), and vmemmap_populate()
> into a single, recursive page table walker. It then implements
> vmemmap_free() and optimizes it by
> - Freeing empty page tables (also done for vmem_remove_range()).
> - Handling cases where the vmemmap of a section does not fill huge pages
>   completely (e.g., sizeof(struct page) == 56).
> 
> vmemmap_free() is currently never used, unless adiing standby memory fails
> (unlikely). This is relevant for virtio-mem, which adds/removes memory
> in memory block/section granularity (always removes memory in the same
> granularity it added it).
> 
> I gave this a proper test with my virtio-mem prototype (which I will share
> in the near future), both with 56 byte memmap per page and 64 byte memmap
> per page, with and without huge page support. In both cases, removing
> memory (routed through arch_remove_memory()) will result in
> - all populated vmemmap pages to get removed/freed
> - all applicable page tables for the vmemmap getting removed/freed
> - all applicable page tables for the idendity mapping getting removed/freed
> Unfortunately, I don't have access to bigger and z/VM (esp. dcss)
> environments.
> 
> This is the basis for real memory hotunplug support for s390x and should
> complete my journey to s390x vmem/vmemmap code for now
> 
> What needs double-checking is tlb flushing. AFAIKS, as there are no valid
> accesses, doing a single range flush at the end is sufficient, both when
> removing vmemmap pages and the idendity mapping.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features
> 
> v1 -> v2:
> - Convert to a single page table walker named "modify_pagetable()", with
>   two helper functions "add_pagetable()" and "remove_pagetable().
> 
> David Hildenbrand (9):
>   s390/vmem: rename vmem_add_mem() to vmem_add_range()
>   s390/vmem: consolidate vmem_add_range() and vmem_remove_range()
>   s390/vmemmap: extend modify_pagetable() to handle vmemmap
>   s390/vmemmap: cleanup when vmemmap_populate() fails
>   s390/vmemmap: take the vmem_mutex when populating/freeing
>   s390/vmem: cleanup empty page tables
>   s390/vmemmap: fallback to PTEs if mapping large PMD fails
>   s390/vmemmap: remember unused sub-pmd ranges
>   s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive
>     sections
> 
>  arch/s390/mm/vmem.c | 637 ++++++++++++++++++++++++++++++--------------
>  1 file changed, 442 insertions(+), 195 deletions(-)

Series applied, thank you!


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-07-24 14:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22  9:45 [PATCH v2 0/9] s390: implement and optimize vmemmap_free() David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 1/9] s390/vmem: rename vmem_add_mem() to vmem_add_range() David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 2/9] s390/vmem: consolidate vmem_add_range() and vmem_remove_range() David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 3/9] s390/vmemmap: extend modify_pagetable() to handle vmemmap David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 4/9] s390/vmemmap: cleanup when vmemmap_populate() fails David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 5/9] s390/vmemmap: take the vmem_mutex when populating/freeing David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 6/9] s390/vmem: cleanup empty page tables David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 7/9] s390/vmemmap: fallback to PTEs if mapping large PMD fails David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 8/9] s390/vmemmap: remember unused sub-pmd ranges David Hildenbrand
2020-07-22  9:45 ` [PATCH v2 9/9] s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive sections David Hildenbrand
2020-07-24 14:32 ` [PATCH v2 0/9] s390: implement and optimize vmemmap_free() Heiko Carstens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).