linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] Cleanup and fixups for vmemmap handling
@ 2021-03-01  8:32 Oscar Salvador
  2021-03-01  8:32 ` [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range Oscar Salvador
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Oscar Salvador @ 2021-03-01  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel,
	Oscar Salvador

Hi Andrew,

Now that 5.12-rc1 is out, and as discussed, here there is a new version on top
of it.
Please, consider picking up the series.

Thanks

Original cover letter:

----

Hi,

this series contains cleanups to remove dead code that handles
unaligned cases for 4K and 1GB pages (patch#1 and patch#2) when
removing the vemmmap range, and a fix (patch#3) to handle the case
when two vmemmap ranges intersect the same PMD.

More details can be found in the respective changelogs.

 v3 -> v4:
 - Rebase on top of 5.12-rc1 as Andrew suggested
 - Added last Reviewed-by for the last patch

 v2 -> v3:
 - Make sure we do not clear the PUD entry in case
   we are not removing the whole range.
 - Add Reviewed-by

 v1 -> v2:
 - Remove dead code in remove_pud_table as well
 - Addessed feedback by David
 - Place the vmemap functions that take care of unaligned PMDs
   within CONFIG_SPARSEMEM_VMEMMAP


Oscar Salvador (3):
  x86/vmemmap: Drop handling of 4K unaligned vmemmap range
  x86/vmemmap: Drop handling of 1GB vmemmap ranges
  x86/vmemmap: Handle unpopulated sub-pmd ranges

 arch/x86/mm/init_64.c | 189 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 118 insertions(+), 71 deletions(-)

-- 
2.16.3



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range
  2021-03-01  8:32 [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Oscar Salvador
@ 2021-03-01  8:32 ` Oscar Salvador
  2021-03-04 15:50   ` Dave Hansen
  2021-03-01  8:32 ` [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges Oscar Salvador
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Oscar Salvador @ 2021-03-01  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel,
	Oscar Salvador

remove_pte_table() is prepared to handle the case where either the
start or the end of the range is not PAGE aligned.
This cannot actually happen:

__populate_section_memmap enforces the range to be PMD aligned,
so as long as the size of the struct page remains multiple of 8,
the vmemmap range will be aligned to PAGE_SIZE.

Drop the dead code and place a VM_BUG_ON in vmemmap_{populate,free}
to catch nasty cases.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/init_64.c | 48 +++++++++++++-----------------------------------
 1 file changed, 13 insertions(+), 35 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b5a3fa4033d3..b0e1d215c83e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -962,7 +962,6 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 {
 	unsigned long next, pages = 0;
 	pte_t *pte;
-	void *page_addr;
 	phys_addr_t phys_addr;
 
 	pte = pte_start + pte_index(addr);
@@ -983,42 +982,15 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		if (phys_addr < (phys_addr_t)0x40000000)
 			return;
 
-		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
-			/*
-			 * Do not free direct mapping pages since they were
-			 * freed when offlining, or simplely not in use.
-			 */
-			if (!direct)
-				free_pagetable(pte_page(*pte), 0);
-
-			spin_lock(&init_mm.page_table_lock);
-			pte_clear(&init_mm, addr, pte);
-			spin_unlock(&init_mm.page_table_lock);
+		if (!direct)
+			free_pagetable(pte_page(*pte), 0);
 
-			/* For non-direct mapping, pages means nothing. */
-			pages++;
-		} else {
-			/*
-			 * If we are here, we are freeing vmemmap pages since
-			 * direct mapped memory ranges to be freed are aligned.
-			 *
-			 * If we are not removing the whole page, it means
-			 * other page structs in this page are being used and
-			 * we canot remove them. So fill the unused page_structs
-			 * with 0xFD, and remove the page when it is wholly
-			 * filled with 0xFD.
-			 */
-			memset((void *)addr, PAGE_INUSE, next - addr);
-
-			page_addr = page_address(pte_page(*pte));
-			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
-				free_pagetable(pte_page(*pte), 0);
+		spin_lock(&init_mm.page_table_lock);
+		pte_clear(&init_mm, addr, pte);
+		spin_unlock(&init_mm.page_table_lock);
 
-				spin_lock(&init_mm.page_table_lock);
-				pte_clear(&init_mm, addr, pte);
-				spin_unlock(&init_mm.page_table_lock);
-			}
-		}
+		/* For non-direct mapping, pages means nothing. */
+		pages++;
 	}
 
 	/* Call free_pte_table() in remove_pmd_table(). */
@@ -1197,6 +1169,9 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 void __ref vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+	VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
+	VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
+
 	remove_pagetable(start, end, false, altmap);
 }
 
@@ -1556,6 +1531,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 {
 	int err;
 
+	VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
+	VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
+
 	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
 		err = vmemmap_populate_basepages(start, end, node, NULL);
 	else if (boot_cpu_has(X86_FEATURE_PSE))
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges
  2021-03-01  8:32 [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Oscar Salvador
  2021-03-01  8:32 ` [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range Oscar Salvador
@ 2021-03-01  8:32 ` Oscar Salvador
  2021-03-04 18:42   ` Dave Hansen
  2021-03-01  8:32 ` [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges Oscar Salvador
  2021-03-03  0:09 ` [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Andrew Morton
  3 siblings, 1 reply; 21+ messages in thread
From: Oscar Salvador @ 2021-03-01  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel,
	Oscar Salvador

We never get to allocate 1GB pages when mapping the vmemmap range.
Drop the dead code both for the aligned and unaligned cases and leave
only the direct map handling.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/init_64.c | 35 +++++++----------------------------
 1 file changed, 7 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b0e1d215c83e..9ecb3c488ac8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1062,7 +1062,6 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 	unsigned long next, pages = 0;
 	pmd_t *pmd_base;
 	pud_t *pud;
-	void *page_addr;
 
 	pud = pud_start + pud_index(addr);
 	for (; addr < end; addr = next, pud++) {
@@ -1071,33 +1070,13 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 		if (!pud_present(*pud))
 			continue;
 
-		if (pud_large(*pud)) {
-			if (IS_ALIGNED(addr, PUD_SIZE) &&
-			    IS_ALIGNED(next, PUD_SIZE)) {
-				if (!direct)
-					free_pagetable(pud_page(*pud),
-						       get_order(PUD_SIZE));
-
-				spin_lock(&init_mm.page_table_lock);
-				pud_clear(pud);
-				spin_unlock(&init_mm.page_table_lock);
-				pages++;
-			} else {
-				/* If here, we are freeing vmemmap pages. */
-				memset((void *)addr, PAGE_INUSE, next - addr);
-
-				page_addr = page_address(pud_page(*pud));
-				if (!memchr_inv(page_addr, PAGE_INUSE,
-						PUD_SIZE)) {
-					free_pagetable(pud_page(*pud),
-						       get_order(PUD_SIZE));
-
-					spin_lock(&init_mm.page_table_lock);
-					pud_clear(pud);
-					spin_unlock(&init_mm.page_table_lock);
-				}
-			}
-
+		if (pud_large(*pud) &&
+		    IS_ALIGNED(addr, PUD_SIZE) &&
+		    IS_ALIGNED(next, PUD_SIZE)) {
+			spin_lock(&init_mm.page_table_lock);
+			pud_clear(pud);
+			spin_unlock(&init_mm.page_table_lock);
+			pages++;
 			continue;
 		}
 
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-01  8:32 [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Oscar Salvador
  2021-03-01  8:32 ` [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range Oscar Salvador
  2021-03-01  8:32 ` [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges Oscar Salvador
@ 2021-03-01  8:32 ` Oscar Salvador
  2021-03-04 17:02   ` Dave Hansen
  2021-03-09 19:39   ` Zi Yan
  2021-03-03  0:09 ` [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Andrew Morton
  3 siblings, 2 replies; 21+ messages in thread
From: Oscar Salvador @ 2021-03-01  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel,
	Oscar Salvador

When the size of a struct page is not multiple of 2MB, sections do
not span a PMD anymore and so when populating them some parts of the
PMD will remain unused.
Because of this, PMDs will be left behind when depopulating sections
since remove_pmd_table() thinks that those unused parts are still in
use.

Fix this by marking the unused parts with PAGE_UNUSED, so memchr_inv()
will do the right thing and will let us free the PMD when the last user
of it is gone.

This patch is based on a similar patch by David Hildenbrand:

https://lore.kernel.org/linux-mm/20200722094558.9828-9-david@redhat.com/
https://lore.kernel.org/linux-mm/20200722094558.9828-10-david@redhat.com/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/init_64.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 98 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 9ecb3c488ac8..7e8de63f02b3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -871,7 +871,93 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	return add_pages(nid, start_pfn, nr_pages, params);
 }
 
-#define PAGE_INUSE 0xFD
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#define PAGE_UNUSED 0xFD
+
+/*
+ * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
+ * from unused_pmd_start to next PMD_SIZE boundary.
+ */
+static unsigned long unused_pmd_start __meminitdata;
+
+static void __meminit vmemmap_flush_unused_pmd(void)
+{
+	if (!unused_pmd_start)
+		return;
+	/*
+	 * Clears (unused_pmd_start, PMD_END]
+	 */
+	memset((void *)unused_pmd_start, PAGE_UNUSED,
+	       ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start);
+	unused_pmd_start = 0;
+}
+
+/* Returns true if the PMD is completely unused and thus it can be freed */
+static bool __meminit vmemmap_unuse_sub_pmd(unsigned long addr, unsigned long end)
+{
+	unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+
+	vmemmap_flush_unused_pmd();
+	memset((void *)addr, PAGE_UNUSED, end - addr);
+
+	return !memchr_inv((void *)start, PAGE_UNUSED, PMD_SIZE);
+}
+
+static void __meminit __vmemmap_use_sub_pmd(unsigned long start)
+{
+	/*
+	 * As we expect to add in the same granularity as we remove, it's
+	 * sufficient to mark only some piece used to block the memmap page from
+	 * getting removed when removing some other adjacent memmap (just in
+	 * case the first memmap never gets initialized e.g., because the memory
+	 * block never gets onlined).
+	 */
+	memset((void *)start, 0, sizeof(struct page));
+}
+
+static void __meminit vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
+{
+	/*
+	 * We only optimize if the new used range directly follows the
+	 * previously unused range (esp., when populating consecutive sections).
+	 */
+	if (unused_pmd_start == start) {
+		if (likely(IS_ALIGNED(end, PMD_SIZE)))
+			unused_pmd_start = 0;
+		else
+			unused_pmd_start = end;
+		return;
+	}
+
+	vmemmap_flush_unused_pmd();
+	__vmemmap_use_sub_pmd(start);
+}
+
+static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned long end)
+{
+	vmemmap_flush_unused_pmd();
+
+	/*
+	 * Could be our memmap page is filled with PAGE_UNUSED already from a
+	 * previous remove.
+	 */
+	__vmemmap_use_sub_pmd(start);
+
+	/*
+	 * Mark the unused parts of the new memmap range
+	 */
+	if (!IS_ALIGNED(start, PMD_SIZE))
+		memset((void *)start, PAGE_UNUSED,
+		       start - ALIGN_DOWN(start, PMD_SIZE));
+	/*
+	 * We want to avoid memset(PAGE_UNUSED) when populating the vmemmap of
+	 * consecutive sections. Remember for the last added PMD the last
+	 * unused range in the populated PMD.
+	 */
+	if (!IS_ALIGNED(end, PMD_SIZE))
+		unused_pmd_start = end;
+}
+#endif
 
 static void __meminit free_pagetable(struct page *page, int order)
 {
@@ -1006,7 +1092,6 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 	unsigned long next, pages = 0;
 	pte_t *pte_base;
 	pmd_t *pmd;
-	void *page_addr;
 
 	pmd = pmd_start + pmd_index(addr);
 	for (; addr < end; addr = next, pmd++) {
@@ -1027,12 +1112,11 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 				spin_unlock(&init_mm.page_table_lock);
 				pages++;
 			} else {
-				/* If here, we are freeing vmemmap pages. */
-				memset((void *)addr, PAGE_INUSE, next - addr);
-
-				page_addr = page_address(pmd_page(*pmd));
-				if (!memchr_inv(page_addr, PAGE_INUSE,
-						PMD_SIZE)) {
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+				/*
+				 * Free the PMD if the whole range is unused.
+				 */
+				if (vmemmap_unuse_sub_pmd(addr, next)) {
 					free_hugepage_table(pmd_page(*pmd),
 							    altmap);
 
@@ -1040,6 +1124,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 					pmd_clear(pmd);
 					spin_unlock(&init_mm.page_table_lock);
 				}
+#endif
 			}
 
 			continue;
@@ -1492,11 +1577,16 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 
 				addr_end = addr + PMD_SIZE;
 				p_end = p + PMD_SIZE;
+
+				if (!IS_ALIGNED(addr, PMD_SIZE) ||
+				    !IS_ALIGNED(next, PMD_SIZE))
+					vmemmap_use_new_sub_pmd(addr, next);
 				continue;
 			} else if (altmap)
 				return -ENOMEM; /* no fallback */
 		} else if (pmd_large(*pmd)) {
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
+			vmemmap_use_sub_pmd(addr, next);
 			continue;
 		}
 		if (vmemmap_populate_basepages(addr, next, node, NULL))
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/3] Cleanup and fixups for vmemmap handling
  2021-03-01  8:32 [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Oscar Salvador
                   ` (2 preceding siblings ...)
  2021-03-01  8:32 ` [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges Oscar Salvador
@ 2021-03-03  0:09 ` Andrew Morton
  2021-03-03 18:38   ` Borislav Petkov
  3 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2021-03-03  0:09 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On Mon,  1 Mar 2021 09:32:27 +0100 Oscar Salvador <osalvador@suse.de> wrote:

> Hi Andrew,
> 
> Now that 5.12-rc1 is out, and as discussed, here there is a new version on top
> of it.
> Please, consider picking up the series.
>

I grabbed them, but...

> 
>  arch/x86/mm/init_64.c | 189 +++++++++++++++++++++++++++++++-------------------

Perhaps a better route would be via an x86 tree.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/3] Cleanup and fixups for vmemmap handling
  2021-03-03  0:09 ` [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Andrew Morton
@ 2021-03-03 18:38   ` Borislav Petkov
  2021-03-03 23:04     ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Borislav Petkov @ 2021-03-03 18:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oscar Salvador, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On Tue, Mar 02, 2021 at 04:09:35PM -0800, Andrew Morton wrote:
> >  arch/x86/mm/init_64.c | 189 +++++++++++++++++++++++++++++++-------------------
> 
> Perhaps a better route would be via an x86 tree.

I assumed you took mm stuff, even if arch-specific. I can still take
them through tip if you prefer...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 0/3] Cleanup and fixups for vmemmap handling
  2021-03-03 18:38   ` Borislav Petkov
@ 2021-03-03 23:04     ` Andrew Morton
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2021-03-03 23:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Oscar Salvador, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On Wed, 3 Mar 2021 19:38:30 +0100 Borislav Petkov <bp@alien8.de> wrote:

> On Tue, Mar 02, 2021 at 04:09:35PM -0800, Andrew Morton wrote:
> > >  arch/x86/mm/init_64.c | 189 +++++++++++++++++++++++++++++++-------------------
> > 
> > Perhaps a better route would be via an x86 tree.
> 
> I assumed you took mm stuff, even if arch-specific. I can still take
> them through tip if you prefer...
> 

I don't mind.  Some review-and-acking would be nice?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range
  2021-03-01  8:32 ` [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range Oscar Salvador
@ 2021-03-04 15:50   ` Dave Hansen
  2021-03-08 18:20     ` Oscar Salvador
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2021-03-04 15:50 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On 3/1/21 12:32 AM, Oscar Salvador wrote:
> remove_pte_table() is prepared to handle the case where either the
> start or the end of the range is not PAGE aligned.
> This cannot actually happen:
> 
> __populate_section_memmap enforces the range to be PMD aligned,
> so as long as the size of the struct page remains multiple of 8,
> the vmemmap range will be aligned to PAGE_SIZE.
> 
> Drop the dead code and place a VM_BUG_ON in vmemmap_{populate,free}
> to catch nasty cases.

I was wondering why the VM_BUG_ON()s went in vmemmap_free() instead of
closer to the code that you modified in remove_pte_table().  I assume
this was because vmemmap_free() is the only (indirect) caller of
remove_pte_table().

Otherwise, this looks fine to me:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-01  8:32 ` [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges Oscar Salvador
@ 2021-03-04 17:02   ` Dave Hansen
  2021-03-04 17:08     ` Dave Hansen
  2021-03-08 18:43     ` Oscar Salvador
  2021-03-09 19:39   ` Zi Yan
  1 sibling, 2 replies; 21+ messages in thread
From: Dave Hansen @ 2021-03-04 17:02 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On 3/1/21 12:32 AM, Oscar Salvador wrote:
> When the size of a struct page is not multiple of 2MB, sections do
> not span a PMD anymore and so when populating them some parts of the
> PMD will remain unused.

Multiples of 2MB are 2MB, 4MB, 6MB, etc...

I think you meant here that 2MB must be a multiple of the 'struct page'
size.  I don't think there are any non-power-of-2 factors of 2MB, so I
think it's probably simpler and more accurate to say:

	When sizeof(struct page) is not a power of 2...

I also don't think I realized that 2MB of 'struct page'
(2M/sizeof(struct page)=32k) fit perfectly into a 128MB section
(32k/64=128M).

> Because of this, PMDs will be left behind when depopulating sections
> since remove_pmd_table() thinks that those unused parts are still in
> use.
> 
> Fix this by marking the unused parts with PAGE_UNUSED, so memchr_inv()
> will do the right thing and will let us free the PMD when the last user
> of it is gone.
> 
> This patch is based on a similar patch by David Hildenbrand:
> 
> https://lore.kernel.org/linux-mm/20200722094558.9828-9-david@redhat.com/
> https://lore.kernel.org/linux-mm/20200722094558.9828-10-david@redhat.com/
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/x86/mm/init_64.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 98 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 9ecb3c488ac8..7e8de63f02b3 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -871,7 +871,93 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  	return add_pages(nid, start_pfn, nr_pages, params);
>  }
>  
> -#define PAGE_INUSE 0xFD
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +#define PAGE_UNUSED 0xFD
> +/*
> + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
> + * from unused_pmd_start to next PMD_SIZE boundary.
> + */
> +static unsigned long unused_pmd_start __meminitdata;

This whole 'unused_pmd_start' thing was unmentioned in the changelog.

I also kept reading this and thinking it was a 'pmd_t *', not a 'struct
page *'.  The naming is rather unfortunate that way.

So, is this here so that the memset()s can be avoided?  It's just an
optimization to say: "This is unused, but instead of marking it with
PAGE_UNUSED (which would be slow) I keep a pointer to it"?

> +static void __meminit vmemmap_flush_unused_pmd(void)
> +{
> +	if (!unused_pmd_start)
> +		return;
> +	/*
> +	 * Clears (unused_pmd_start, PMD_END]
> +	 */
> +	memset((void *)unused_pmd_start, PAGE_UNUSED,
> +	       ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start);
> +	unused_pmd_start = 0;
> +}

Oh, and this means: "stop using the unused_pmd_start optimization.  Just
memset the thing".

> +/* Returns true if the PMD is completely unused and thus it can be freed */
> +static bool __meminit vmemmap_unuse_sub_pmd(unsigned long addr, unsigned long end)
> +{
> +	unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
> +
> +	vmemmap_flush_unused_pmd();

That probably needs a comment like:

	Flush the unused range cache to ensure that the memchr_inv()
	will work for the whole range.

> +	memset((void *)addr, PAGE_UNUSED, end - addr);
> +
> +	return !memchr_inv((void *)start, PAGE_UNUSED, PMD_SIZE);
> +}

Also, logically, it would make a lot of sense if you can move the actual
PMD freeing logic in here.  That way, the caller is just saying, "unuse
this PMD region", and then this takes care of the rest.  As it stands,
it's a bit weird that the caller takes care of the freeing.

> +static void __meminit __vmemmap_use_sub_pmd(unsigned long start)
> +{
> +	/*
> +	 * As we expect to add in the same granularity as we remove, it's
> +	 * sufficient to mark only some piece used to block the memmap page from
> +	 * getting removed when removing some other adjacent memmap (just in
> +	 * case the first memmap never gets initialized e.g., because the memory
> +	 * block never gets onlined).
> +	 */
> +	memset((void *)start, 0, sizeof(struct page));
> +}
> +
> +static void __meminit vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
> +{
> +	/*
> +	 * We only optimize if the new used range directly follows the
> +	 * previously unused range (esp., when populating consecutive sections).
> +	 */
> +	if (unused_pmd_start == start) {
> +		if (likely(IS_ALIGNED(end, PMD_SIZE)))
> +			unused_pmd_start = 0;
> +		else
> +			unused_pmd_start = end;
> +		return;
> +	}
> +
> +	vmemmap_flush_unused_pmd();
> +	__vmemmap_use_sub_pmd(start);
> +}
> +
> +static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned long end)
> +{
> +	vmemmap_flush_unused_pmd();
> +
> +	/*
> +	 * Could be our memmap page is filled with PAGE_UNUSED already from a
> +	 * previous remove.
> +	 */
> +	__vmemmap_use_sub_pmd(start);
> +
> +	/*
> +	 * Mark the unused parts of the new memmap range
> +	 */
> +	if (!IS_ALIGNED(start, PMD_SIZE))
> +		memset((void *)start, PAGE_UNUSED,
> +		       start - ALIGN_DOWN(start, PMD_SIZE));
> +	/*
> +	 * We want to avoid memset(PAGE_UNUSED) when populating the vmemmap of
> +	 * consecutive sections. Remember for the last added PMD the last
> +	 * unused range in the populated PMD.
> +	 */
> +	if (!IS_ALIGNED(end, PMD_SIZE))
> +		unused_pmd_start = end;
> +}
> +#endif
>  
>  static void __meminit free_pagetable(struct page *page, int order)
>  {
> @@ -1006,7 +1092,6 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>  	unsigned long next, pages = 0;
>  	pte_t *pte_base;
>  	pmd_t *pmd;
> -	void *page_addr;
>  
>  	pmd = pmd_start + pmd_index(addr);
>  	for (; addr < end; addr = next, pmd++) {
> @@ -1027,12 +1112,11 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>  				spin_unlock(&init_mm.page_table_lock);
>  				pages++;
>  			} else {
> -				/* If here, we are freeing vmemmap pages. */
> -				memset((void *)addr, PAGE_INUSE, next - addr);
> -
> -				page_addr = page_address(pmd_page(*pmd));
> -				if (!memchr_inv(page_addr, PAGE_INUSE,
> -						PMD_SIZE)) {
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +				/*
> +				 * Free the PMD if the whole range is unused.
> +				 */
> +				if (vmemmap_unuse_sub_pmd(addr, next)) {
>  					free_hugepage_table(pmd_page(*pmd),
>  							    altmap);
>  
> @@ -1040,6 +1124,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>  					pmd_clear(pmd);
>  					spin_unlock(&init_mm.page_table_lock);
>  				}
> +#endif
>  			}

This doesn't look like the world's longest if() statement, but it might
be nice to use the IS_ENABLED() syntax instead of an #ifdef.  I suspect
the compiler could even make quick work of the static functions that
never get called as a result.

>  			continue;
> @@ -1492,11 +1577,16 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
>  
>  				addr_end = addr + PMD_SIZE;
>  				p_end = p + PMD_SIZE;
> +
> +				if (!IS_ALIGNED(addr, PMD_SIZE) ||
> +				    !IS_ALIGNED(next, PMD_SIZE))
> +					vmemmap_use_new_sub_pmd(addr, next);
>  				continue;
>  			} else if (altmap)
>  				return -ENOMEM; /* no fallback */
>  		} else if (pmd_large(*pmd)) {
>  			vmemmap_verify((pte_t *)pmd, node, addr, next);
> +			vmemmap_use_sub_pmd(addr, next);
>  			continue;
>  		}
>  		if (vmemmap_populate_basepages(addr, next, node, NULL))
> 

This overall looks like a good thing to do.  The implementation is even
pretty nice and simple. But, it took me an awfully long time to figure
out what was going on.

I wonder if you could take one more pass at these and especially see if
you can better explain the use of 'unused_pmd_start'.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-04 17:02   ` Dave Hansen
@ 2021-03-04 17:08     ` Dave Hansen
  2021-03-05 18:26       ` David Hildenbrand
  2021-03-08 18:43     ` Oscar Salvador
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2021-03-04 17:08 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On 3/4/21 9:02 AM, Dave Hansen wrote:
>> +#define PAGE_UNUSED 0xFD
>> +/*
>> + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
>> + * from unused_pmd_start to next PMD_SIZE boundary.
>> + */
>> +static unsigned long unused_pmd_start __meminitdata;
> This whole 'unused_pmd_start' thing was unmentioned in the changelog.

One tiny suggestion: *Sometimes* for these optimizations, it's easiest
to write the code up without it in one patch, then add the optimization
in the next patch.

It makes it 100% clear what is part of the "core" algorithm and what is
pure optimization.

I don't know if it will work here, but it might be worth taking a look.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges
  2021-03-01  8:32 ` [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges Oscar Salvador
@ 2021-03-04 18:42   ` Dave Hansen
  2021-03-08 18:48     ` Oscar Salvador
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2021-03-04 18:42 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: David Hildenbrand, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On 3/1/21 12:32 AM, Oscar Salvador wrote:
> We never get to allocate 1GB pages when mapping the vmemmap range.
> Drop the dead code both for the aligned and unaligned cases and leave
> only the direct map handling.

Could you elaborate a bit on why 1GB pages are never used?  It is just
unlikely to have a 64GB contiguous area of memory that needs 1GB of
contiguous vmemmap?  Or, does the fact that sections are smaller than
64GB keeps this from happening?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-04 17:08     ` Dave Hansen
@ 2021-03-05 18:26       ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-03-05 18:26 UTC (permalink / raw)
  To: Dave Hansen, Oscar Salvador, Andrew Morton
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H . Peter Anvin, Michal Hocko,
	linux-mm, linux-kernel

On 04.03.21 18:08, Dave Hansen wrote:
> On 3/4/21 9:02 AM, Dave Hansen wrote:
>>> +#define PAGE_UNUSED 0xFD
>>> +/*
>>> + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
>>> + * from unused_pmd_start to next PMD_SIZE boundary.
>>> + */
>>> +static unsigned long unused_pmd_start __meminitdata;
>> This whole 'unused_pmd_start' thing was unmentioned in the changelog.
> 
> One tiny suggestion: *Sometimes* for these optimizations, it's easiest
> to write the code up without it in one patch, then add the optimization
> in the next patch.
> 
> It makes it 100% clear what is part of the "core" algorithm and what is
> pure optimization.
> 
> I don't know if it will work here, but it might be worth taking a look.
> 

For this reason the s390x part by me (see patch description) was two 
separate patches. Maybe it also makes sense to split it up here.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range
  2021-03-04 15:50   ` Dave Hansen
@ 2021-03-08 18:20     ` Oscar Salvador
  2021-03-08 18:26       ` Dave Hansen
  0 siblings, 1 reply; 21+ messages in thread
From: Oscar Salvador @ 2021-03-08 18:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On Thu, Mar 04, 2021 at 07:50:10AM -0800, Dave Hansen wrote:
> On 3/1/21 12:32 AM, Oscar Salvador wrote:
> > remove_pte_table() is prepared to handle the case where either the
> > start or the end of the range is not PAGE aligned.
> > This cannot actually happen:
> > 
> > __populate_section_memmap enforces the range to be PMD aligned,
> > so as long as the size of the struct page remains multiple of 8,
> > the vmemmap range will be aligned to PAGE_SIZE.
> > 
> > Drop the dead code and place a VM_BUG_ON in vmemmap_{populate,free}
> > to catch nasty cases.
> 
> I was wondering why the VM_BUG_ON()s went in vmemmap_free() instead of
> closer to the code that you modified in remove_pte_table().  I assume
> this was because vmemmap_free() is the only (indirect) caller of
> remove_pte_table().

Yes, that was pretty much the reason.
It seemed reasonable to me to fence it off at the "gate", and not further
deep.

Does it make sense to you? May I keep your Ack?

Thanks Dave!

> 
> Otherwise, this looks fine to me:
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
> 

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range
  2021-03-08 18:20     ` Oscar Salvador
@ 2021-03-08 18:26       ` Dave Hansen
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Hansen @ 2021-03-08 18:26 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On 3/8/21 10:20 AM, Oscar Salvador wrote:
> On Thu, Mar 04, 2021 at 07:50:10AM -0800, Dave Hansen wrote:
>> On 3/1/21 12:32 AM, Oscar Salvador wrote:
>>> remove_pte_table() is prepared to handle the case where either the
>>> start or the end of the range is not PAGE aligned.
>>> This cannot actually happen:
>>>
>>> __populate_section_memmap enforces the range to be PMD aligned,
>>> so as long as the size of the struct page remains multiple of 8,
>>> the vmemmap range will be aligned to PAGE_SIZE.
>>>
>>> Drop the dead code and place a VM_BUG_ON in vmemmap_{populate,free}
>>> to catch nasty cases.
>> I was wondering why the VM_BUG_ON()s went in vmemmap_free() instead of
>> closer to the code that you modified in remove_pte_table().  I assume
>> this was because vmemmap_free() is the only (indirect) caller of
>> remove_pte_table().
> Yes, that was pretty much the reason.
> It seemed reasonable to me to fence it off at the "gate", and not further
> deep.
> 
> Does it make sense to you? May I keep your Ack?

Yep, makes sense.  If you rev the series, it would be nice to put that
in the changelog.  But, either way, please keep the Ack!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-04 17:02   ` Dave Hansen
  2021-03-04 17:08     ` Dave Hansen
@ 2021-03-08 18:43     ` Oscar Salvador
  2021-03-09  8:25       ` Oscar Salvador
  1 sibling, 1 reply; 21+ messages in thread
From: Oscar Salvador @ 2021-03-08 18:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On Thu, Mar 04, 2021 at 09:02:36AM -0800, Dave Hansen wrote:
> On 3/1/21 12:32 AM, Oscar Salvador wrote:
> > When the size of a struct page is not multiple of 2MB, sections do
> > not span a PMD anymore and so when populating them some parts of the
> > PMD will remain unused.
> 
> Multiples of 2MB are 2MB, 4MB, 6MB, etc...
> 
> I think you meant here that 2MB must be a multiple of the 'struct page'
> size.  I don't think there are any non-power-of-2 factors of 2MB, so I
> think it's probably simpler and more accurate to say:
> 
> 	When sizeof(struct page) is not a power of 2...

Yes, I think this was a poor choice of words.
It was the other way around, that PMD must be multiple of struct page size.
"When sizeof(struct page) is not a power of 2..." sounds definitely better.

> > +static unsigned long unused_pmd_start __meminitdata;
> 
> This whole 'unused_pmd_start' thing was unmentioned in the changelog.

Sorry, I will expand some more.

> I also kept reading this and thinking it was a 'pmd_t *', not a 'struct
> page *'.  The naming is rather unfortunate that way.

Well, it is the pmd range.

> 
> So, is this here so that the memset()s can be avoided?  It's just an
> optimization to say: "This is unused, but instead of marking it with
> PAGE_UNUSED (which would be slow) I keep a pointer to it"?

Yes, it is an optimization that let us avoid a memset() in case the sections
we are adding are consetuvie.

> > +static void __meminit vmemmap_flush_unused_pmd(void)
> > +{
> > +	if (!unused_pmd_start)
> > +		return;
> > +	/*
> > +	 * Clears (unused_pmd_start, PMD_END]
> > +	 */
> > +	memset((void *)unused_pmd_start, PAGE_UNUSED,
> > +	       ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start);
> > +	unused_pmd_start = 0;
> > +}
> 
> Oh, and this means: "stop using the unused_pmd_start optimization.  Just
> memset the thing".

Yes, as stated above, this optimization only takes place as long as the sections
are being added consecutive.
As soon as that is not the case, we memset the last range recorded and reset
unused_pmd_start.

> > +/* Returns true if the PMD is completely unused and thus it can be freed */
> > +static bool __meminit vmemmap_unuse_sub_pmd(unsigned long addr, unsigned long end)
> > +{
> > +	unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
> > +
> > +	vmemmap_flush_unused_pmd();
> 
> That probably needs a comment like:
> 
> 	Flush the unused range cache to ensure that the memchr_inv()
> 	will work for the whole range.

Sure, I will add a comment.

> 
> > +	memset((void *)addr, PAGE_UNUSED, end - addr);
> > +
> > +	return !memchr_inv((void *)start, PAGE_UNUSED, PMD_SIZE);
> > +}
> 
> Also, logically, it would make a lot of sense if you can move the actual
> PMD freeing logic in here.  That way, the caller is just saying, "unuse
> this PMD region", and then this takes care of the rest.  As it stands,
> it's a bit weird that the caller takes care of the freeing.

You mean to move the 

 free_hugepage_table(pmd_page(*pmd), altmap);
 spin_lock(&init_mm.page_table_lock);
 pmd_clear(pmd);
 spin_unlock(&init_mm.page_table_lock);

block in there?

Well, from where I see it, it is more like:

 if (is_the_range_unused())
  : if so, free everything

But I agree with you what it might make some sense to move it there.
Since I do not feel strong about this, I will move it.

> > +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> > +				/*
> > +				 * Free the PMD if the whole range is unused.
> > +				 */
> > +				if (vmemmap_unuse_sub_pmd(addr, next)) {
> >  					free_hugepage_table(pmd_page(*pmd),
> >  							    altmap);
> >  
> > @@ -1040,6 +1124,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> >  					pmd_clear(pmd);
> >  					spin_unlock(&init_mm.page_table_lock);
> >  				}
> > +#endif
> >  			}
> 
> This doesn't look like the world's longest if() statement, but it might
> be nice to use the IS_ENABLED() syntax instead of an #ifdef.  I suspect
> the compiler could even make quick work of the static functions that
> never get called as a result.

Sure, I will replace the #ifdef with IS_ENABLED.

> 
> >  			continue;
> > @@ -1492,11 +1577,16 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
> >  
> >  				addr_end = addr + PMD_SIZE;
> >  				p_end = p + PMD_SIZE;
> > +
> > +				if (!IS_ALIGNED(addr, PMD_SIZE) ||
> > +				    !IS_ALIGNED(next, PMD_SIZE))
> > +					vmemmap_use_new_sub_pmd(addr, next);
> >  				continue;
> >  			} else if (altmap)
> >  				return -ENOMEM; /* no fallback */
> >  		} else if (pmd_large(*pmd)) {
> >  			vmemmap_verify((pte_t *)pmd, node, addr, next);
> > +			vmemmap_use_sub_pmd(addr, next);
> >  			continue;
> >  		}
> >  		if (vmemmap_populate_basepages(addr, next, node, NULL))
> > 
> 
> This overall looks like a good thing to do.  The implementation is even
> pretty nice and simple. But, it took me an awfully long time to figure
> out what was going on.
> 
> I wonder if you could take one more pass at these and especially see if
> you can better explain the use of 'unused_pmd_start'.

In order to ease the review, I will split the core-changes and the optimization.
So I will place the whole unused_pmd_start opimization in a different patch.

Thanks for the feedback Dave!


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges
  2021-03-04 18:42   ` Dave Hansen
@ 2021-03-08 18:48     ` Oscar Salvador
  2021-03-08 19:25       ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Oscar Salvador @ 2021-03-08 18:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On Thu, Mar 04, 2021 at 10:42:59AM -0800, Dave Hansen wrote:
> On 3/1/21 12:32 AM, Oscar Salvador wrote:
> > We never get to allocate 1GB pages when mapping the vmemmap range.
> > Drop the dead code both for the aligned and unaligned cases and leave
> > only the direct map handling.
> 
> Could you elaborate a bit on why 1GB pages are never used?  It is just
> unlikely to have a 64GB contiguous area of memory that needs 1GB of
> contiguous vmemmap?  Or, does the fact that sections are smaller than
> 64GB keeps this from happening?

AFAIK, the biggest we populate vmemmap pages with is 2MB, plus the fact
that as you pointed out, memory sections on x86_64 are 128M, which is
way smaller than what would require to allocate a 1GB for vmemmap pages.

Am I missing something?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges
  2021-03-08 18:48     ` Oscar Salvador
@ 2021-03-08 19:25       ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-03-08 19:25 UTC (permalink / raw)
  To: Oscar Salvador, Dave Hansen
  Cc: Andrew Morton, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On 08.03.21 19:48, Oscar Salvador wrote:
> On Thu, Mar 04, 2021 at 10:42:59AM -0800, Dave Hansen wrote:
>> On 3/1/21 12:32 AM, Oscar Salvador wrote:
>>> We never get to allocate 1GB pages when mapping the vmemmap range.
>>> Drop the dead code both for the aligned and unaligned cases and leave
>>> only the direct map handling.
>>
>> Could you elaborate a bit on why 1GB pages are never used?  It is just
>> unlikely to have a 64GB contiguous area of memory that needs 1GB of
>> contiguous vmemmap?  Or, does the fact that sections are smaller than
>> 64GB keeps this from happening?
> 
> AFAIK, the biggest we populate vmemmap pages with is 2MB, plus the fact
> that as you pointed out, memory sections on x86_64 are 128M, which is
> way smaller than what would require to allocate a 1GB for vmemmap pages.
> 
> Am I missing something?

Right now, it is dead code that you are removing.

Just like for 2MB vmemmap pages, we would proactively have populate 1G 
pages when adding individual sections. You can easily waste a lot of memory.

Of course, one could also make a final pass over the tables to see where 
it makes sense forming 1GB pages.

But then, we would need quite some logic when removing individual 
sections (e.g., a 128 MB DIMM) - and I remember there are corner cases 
where we might have to remove boot memory ...

Long story short, I don't think 1G vmemmap pages are really worth the 
trouble.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-08 18:43     ` Oscar Salvador
@ 2021-03-09  8:25       ` Oscar Salvador
  2021-03-09 14:50         ` Dave Hansen
  0 siblings, 1 reply; 21+ messages in thread
From: Oscar Salvador @ 2021-03-09  8:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On Mon, Mar 08, 2021 at 07:43:30PM +0100, Oscar Salvador wrote:
> On Thu, Mar 04, 2021 at 09:02:36AM -0800, Dave Hansen wrote:
> > Also, logically, it would make a lot of sense if you can move the actual
> > PMD freeing logic in here.  That way, the caller is just saying, "unuse
> > this PMD region", and then this takes care of the rest.  As it stands,
> > it's a bit weird that the caller takes care of the freeing.
> 
> You mean to move the 
> 
>  free_hugepage_table(pmd_page(*pmd), altmap);
>  spin_lock(&init_mm.page_table_lock);
>  pmd_clear(pmd);
>  spin_unlock(&init_mm.page_table_lock);
> 
> block in there?
> 
> Well, from where I see it, it is more like:
> 
>  if (is_the_range_unused())
>   : if so, free everything
> 
> But I agree with you what it might make some sense to move it there.
> Since I do not feel strong about this, I will move it.

hi Dave,

So, after splitting this patch and re-shape it to address all the
feedback, I am still not sure about this one.
Honestly, I think the freeing logic would be better off kept in
remove_pmd_table.

The reason for me is that vmemmap_unuse_sub_pmd only 1) marks the range
to be removed as unused and 2) checks whether after that, the whole
PMD range is unused.

I think the confusion comes from the name.
"vmemmap_pmd_is_unused" might be a better fit?

What do you think? Do you feel strong about moving the log in there
regardless of the name?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-09  8:25       ` Oscar Salvador
@ 2021-03-09 14:50         ` Dave Hansen
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Hansen @ 2021-03-09 14:50 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On 3/9/21 12:25 AM, Oscar Salvador wrote:
> 
> I think the confusion comes from the name.
> "vmemmap_pmd_is_unused" might be a better fit?
> 
> What do you think? Do you feel strong about moving the log in there
> regardless of the name?

No, not really.  The name is probably worth adjusting, but I think the
code can probably stay put.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-01  8:32 ` [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges Oscar Salvador
  2021-03-04 17:02   ` Dave Hansen
@ 2021-03-09 19:39   ` Zi Yan
  2021-03-09 21:26     ` Oscar Salvador
  1 sibling, 1 reply; 21+ messages in thread
From: Zi Yan @ 2021-03-09 19:39 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6096 bytes --]

On 1 Mar 2021, at 3:32, Oscar Salvador wrote:

> When the size of a struct page is not multiple of 2MB, sections do
> not span a PMD anymore and so when populating them some parts of the
> PMD will remain unused.
> Because of this, PMDs will be left behind when depopulating sections
> since remove_pmd_table() thinks that those unused parts are still in
> use.
>
> Fix this by marking the unused parts with PAGE_UNUSED, so memchr_inv()
> will do the right thing and will let us free the PMD when the last user
> of it is gone.
>
> This patch is based on a similar patch by David Hildenbrand:
>
> https://lore.kernel.org/linux-mm/20200722094558.9828-9-david@redhat.com/
> https://lore.kernel.org/linux-mm/20200722094558.9828-10-david@redhat.com/
>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/x86/mm/init_64.c | 106 ++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 98 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 9ecb3c488ac8..7e8de63f02b3 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -871,7 +871,93 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  	return add_pages(nid, start_pfn, nr_pages, params);
>  }
>
> -#define PAGE_INUSE 0xFD
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +#define PAGE_UNUSED 0xFD
> +
> +/*
> + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
> + * from unused_pmd_start to next PMD_SIZE boundary.
> + */
> +static unsigned long unused_pmd_start __meminitdata;
> +
> +static void __meminit vmemmap_flush_unused_pmd(void)
> +{
> +	if (!unused_pmd_start)
> +		return;
> +	/*
> +	 * Clears (unused_pmd_start, PMD_END]
> +	 */
> +	memset((void *)unused_pmd_start, PAGE_UNUSED,
> +	       ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start);
> +	unused_pmd_start = 0;
> +}
> +
> +/* Returns true if the PMD is completely unused and thus it can be freed */
> +static bool __meminit vmemmap_unuse_sub_pmd(unsigned long addr, unsigned long end)
> +{
> +	unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
> +
> +	vmemmap_flush_unused_pmd();
> +	memset((void *)addr, PAGE_UNUSED, end - addr);
> +
> +	return !memchr_inv((void *)start, PAGE_UNUSED, PMD_SIZE);
> +}
> +
> +static void __meminit __vmemmap_use_sub_pmd(unsigned long start)
> +{
> +	/*
> +	 * As we expect to add in the same granularity as we remove, it's
> +	 * sufficient to mark only some piece used to block the memmap page from
> +	 * getting removed when removing some other adjacent memmap (just in
> +	 * case the first memmap never gets initialized e.g., because the memory
> +	 * block never gets onlined).
> +	 */
> +	memset((void *)start, 0, sizeof(struct page));
> +}
> +
> +static void __meminit vmemmap_use_sub_pmd(unsigned long start, unsigned long end)
> +{
> +	/*
> +	 * We only optimize if the new used range directly follows the
> +	 * previously unused range (esp., when populating consecutive sections).
> +	 */
> +	if (unused_pmd_start == start) {
> +		if (likely(IS_ALIGNED(end, PMD_SIZE)))
> +			unused_pmd_start = 0;
> +		else
> +			unused_pmd_start = end;
> +		return;
> +	}
> +
> +	vmemmap_flush_unused_pmd();
> +	__vmemmap_use_sub_pmd(start);
> +}
> +
> +static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned long end)
> +{
> +	vmemmap_flush_unused_pmd();
> +
> +	/*
> +	 * Could be our memmap page is filled with PAGE_UNUSED already from a
> +	 * previous remove.
> +	 */
> +	__vmemmap_use_sub_pmd(start);
> +
> +	/*
> +	 * Mark the unused parts of the new memmap range
> +	 */
> +	if (!IS_ALIGNED(start, PMD_SIZE))
> +		memset((void *)start, PAGE_UNUSED,
> +		       start - ALIGN_DOWN(start, PMD_SIZE));
> +	/*
> +	 * We want to avoid memset(PAGE_UNUSED) when populating the vmemmap of
> +	 * consecutive sections. Remember for the last added PMD the last
> +	 * unused range in the populated PMD.
> +	 */
> +	if (!IS_ALIGNED(end, PMD_SIZE))
> +		unused_pmd_start = end;
> +}
> +#endif
>
>  static void __meminit free_pagetable(struct page *page, int order)
>  {
> @@ -1006,7 +1092,6 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>  	unsigned long next, pages = 0;
>  	pte_t *pte_base;
>  	pmd_t *pmd;
> -	void *page_addr;
>
>  	pmd = pmd_start + pmd_index(addr);
>  	for (; addr < end; addr = next, pmd++) {
> @@ -1027,12 +1112,11 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>  				spin_unlock(&init_mm.page_table_lock);
>  				pages++;
>  			} else {
> -				/* If here, we are freeing vmemmap pages. */
> -				memset((void *)addr, PAGE_INUSE, next - addr);
> -
> -				page_addr = page_address(pmd_page(*pmd));
> -				if (!memchr_inv(page_addr, PAGE_INUSE,
> -						PMD_SIZE)) {
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +				/*
> +				 * Free the PMD if the whole range is unused.
> +				 */
> +				if (vmemmap_unuse_sub_pmd(addr, next)) {
>  					free_hugepage_table(pmd_page(*pmd),
>  							    altmap);
>
> @@ -1040,6 +1124,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>  					pmd_clear(pmd);
>  					spin_unlock(&init_mm.page_table_lock);
>  				}
> +#endif
>  			}
>
>  			continue;
> @@ -1492,11 +1577,16 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
>
>  				addr_end = addr + PMD_SIZE;
>  				p_end = p + PMD_SIZE;
> +
> +				if (!IS_ALIGNED(addr, PMD_SIZE) ||
> +				    !IS_ALIGNED(next, PMD_SIZE))
> +					vmemmap_use_new_sub_pmd(addr, next);
>  				continue;
>  			} else if (altmap)
>  				return -ENOMEM; /* no fallback */
>  		} else if (pmd_large(*pmd)) {
>  			vmemmap_verify((pte_t *)pmd, node, addr, next);
> +			vmemmap_use_sub_pmd(addr, next);

Hi Oscar,

vmemmap_use_new_sub_pmd and vmemmap_use_sub_pmd are not defined when CONFIG_SPARSEMEM_VMEMMAP
and !CONFIG_MEMORY_HOTPLUG. It leads to compilation errors.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges
  2021-03-09 19:39   ` Zi Yan
@ 2021-03-09 21:26     ` Oscar Salvador
  0 siblings, 0 replies; 21+ messages in thread
From: Oscar Salvador @ 2021-03-09 21:26 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, David Hildenbrand, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	x86, H . Peter Anvin, Michal Hocko, linux-mm, linux-kernel

On Tue, Mar 09, 2021 at 02:39:02PM -0500, Zi Yan wrote:
> Hi Oscar,
> 
> vmemmap_use_new_sub_pmd and vmemmap_use_sub_pmd are not defined when CONFIG_SPARSEMEM_VMEMMAP
> and !CONFIG_MEMORY_HOTPLUG. It leads to compilation errors.

Meh, yeah, that's right.

I fixed that up, I will send a v6.

Thanks for noticing!



-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-03-09 21:27 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-01  8:32 [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Oscar Salvador
2021-03-01  8:32 ` [PATCH v4 1/3] x86/vmemmap: Drop handling of 4K unaligned vmemmap range Oscar Salvador
2021-03-04 15:50   ` Dave Hansen
2021-03-08 18:20     ` Oscar Salvador
2021-03-08 18:26       ` Dave Hansen
2021-03-01  8:32 ` [PATCH v4 2/3] x86/vmemmap: Drop handling of 1GB vmemmap ranges Oscar Salvador
2021-03-04 18:42   ` Dave Hansen
2021-03-08 18:48     ` Oscar Salvador
2021-03-08 19:25       ` David Hildenbrand
2021-03-01  8:32 ` [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges Oscar Salvador
2021-03-04 17:02   ` Dave Hansen
2021-03-04 17:08     ` Dave Hansen
2021-03-05 18:26       ` David Hildenbrand
2021-03-08 18:43     ` Oscar Salvador
2021-03-09  8:25       ` Oscar Salvador
2021-03-09 14:50         ` Dave Hansen
2021-03-09 19:39   ` Zi Yan
2021-03-09 21:26     ` Oscar Salvador
2021-03-03  0:09 ` [PATCH v4 0/3] Cleanup and fixups for vmemmap handling Andrew Morton
2021-03-03 18:38   ` Borislav Petkov
2021-03-03 23:04     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).