All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] VA allocator fixes
@ 2017-11-06 10:03 Nicholas Piggin
  2017-11-06 10:03 ` [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V, Florian Weimer

Florian found a nasty corner case with the VA allocation logic
for crossing from 128TB to 512TB limit on hash, and made a
really superb report of the problem -- traces, reproducer recipes,
analysis, etc. which already mostly solved it.

The first patch in the series should solve Florian's particular
case, the next 3 are other issues with addr_limit. The last
patch is technically a cleanup but I think it's fairly important
in terms of understanding the code and also enabling some BUG
checks (when addr_limit == 0).

I have not tested these exactly on Florian's test case, but
some tests of my own behave better afterwards. Hopefully he has
time to re-test. Some careful review would be welcome too.

Thanks,
Nick

Nicholas Piggin (5):
  powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case
    allocation
  powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
  powerpc/64s/hash: Fix fork() with 512TB process address space
  powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case
    allocation
  powerpc/64s: mm_context.addr_limit is only used on hash

 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  2 +-
 arch/powerpc/include/asm/book3s/64/mmu.h      |  2 +-
 arch/powerpc/include/asm/paca.h               |  2 +-
 arch/powerpc/kernel/asm-offsets.c             |  2 +-
 arch/powerpc/kernel/paca.c                    |  4 ++--
 arch/powerpc/kernel/setup-common.c            |  3 ++-
 arch/powerpc/mm/hugetlbpage-radix.c           | 14 +++++-------
 arch/powerpc/mm/mmap.c                        | 31 ++++++++++----------------
 arch/powerpc/mm/mmu_context_book3s64.c        |  8 +++----
 arch/powerpc/mm/slb_low.S                     |  2 +-
 arch/powerpc/mm/slice.c                       | 32 ++++++++++++++-------------
 11 files changed, 48 insertions(+), 54 deletions(-)

-- 
2.15.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 10:03 [PATCH 0/5] VA allocator fixes Nicholas Piggin
@ 2017-11-06 10:03 ` Nicholas Piggin
  2017-11-06 10:38   ` Aneesh Kumar K.V
  2017-11-06 10:03 ` [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary Nicholas Piggin
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V, Florian Weimer

When allocating VA space with a hint that crosses 128TB, the SLB addr_limit
variable is not expanded if addr is not > 128TB, but the slice allocation
looks at task_size, which is 512TB. This results in slice_check_fit()
incorrectly succeeding because the slice_count truncates off bit 128 of the
requested mask, so the comparison to the available mask succeeds.

Fix this by using mm->context.addr_limit instead of mm->task_size for
testing allocation limits. This causes such allocations to fail.

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Reported-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/slice.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 45f6740dd407..567db541c0a1 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -96,7 +96,7 @@ static int slice_area_is_free(struct mm_struct *mm, unsigned long addr,
 {
 	struct vm_area_struct *vma;
 
-	if ((mm->task_size - len) < addr)
+	if ((mm->context.addr_limit - len) < addr)
 		return 0;
 	vma = find_vma(mm, addr);
 	return (!vma || (addr + len) <= vm_start_gap(vma));
@@ -133,7 +133,7 @@ static void slice_mask_for_free(struct mm_struct *mm, struct slice_mask *ret)
 		if (!slice_low_has_vma(mm, i))
 			ret->low_slices |= 1u << i;
 
-	if (mm->task_size <= SLICE_LOW_TOP)
+	if (mm->context.addr_limit <= SLICE_LOW_TOP)
 		return;
 
 	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.addr_limit); i++)
@@ -446,19 +446,20 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 
 	/* Sanity checks */
 	BUG_ON(mm->task_size == 0);
+	BUG_ON(mm->context.addr_limit == 0);
 	VM_BUG_ON(radix_enabled());
 
 	slice_dbg("slice_get_unmapped_area(mm=%p, psize=%d...\n", mm, psize);
 	slice_dbg(" addr=%lx, len=%lx, flags=%lx, topdown=%d\n",
 		  addr, len, flags, topdown);
 
-	if (len > mm->task_size)
+	if (len > mm->context.addr_limit)
 		return -ENOMEM;
 	if (len & ((1ul << pshift) - 1))
 		return -EINVAL;
 	if (fixed && (addr & ((1ul << pshift) - 1)))
 		return -EINVAL;
-	if (fixed && addr > (mm->task_size - len))
+	if (fixed && addr > (mm->context.addr_limit - len))
 		return -ENOMEM;
 
 	/* If hint, make sure it matches our alignment restrictions */
@@ -466,7 +467,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 		addr = _ALIGN_UP(addr, 1ul << pshift);
 		slice_dbg(" aligned addr=%lx\n", addr);
 		/* Ignore hint if it's too large or overlaps a VMA */
-		if (addr > mm->task_size - len ||
+		if (addr > mm->context.addr_limit - len ||
 		    !slice_area_is_free(mm, addr, len))
 			addr = 0;
 	}
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
  2017-11-06 10:03 [PATCH 0/5] VA allocator fixes Nicholas Piggin
  2017-11-06 10:03 ` [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
@ 2017-11-06 10:03 ` Nicholas Piggin
  2017-11-06 10:44   ` Aneesh Kumar K.V
  2017-11-06 10:03 ` [PATCH 3/5] powerpc/64s/hash: Fix fork() with 512TB process address space Nicholas Piggin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V, Florian Weimer

While mapping hints with a length that cross 128TB are disallowed,
MAP_FIXED allocations that cross 128TB are allowed. These are failing
on hash (on radix they succeed). Add an additional case for fixed
mappings to expand the addr_limit when crossing 128TB.

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/slice.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 567db541c0a1..f980397b449d 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -419,7 +419,8 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 	/*
 	 * Check if we need to expland slice area.
 	 */
-	if (unlikely(addr > mm->context.addr_limit &&
+	if (unlikely(((addr > mm->context.addr_limit) ||
+			(fixed && addr + len > mm->context.addr_limit)) &&
 		     mm->context.addr_limit != TASK_SIZE)) {
 		mm->context.addr_limit = TASK_SIZE;
 		on_each_cpu(slice_flush_segments, mm, 1);
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] powerpc/64s/hash: Fix fork() with 512TB process address space
  2017-11-06 10:03 [PATCH 0/5] VA allocator fixes Nicholas Piggin
  2017-11-06 10:03 ` [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
  2017-11-06 10:03 ` [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary Nicholas Piggin
@ 2017-11-06 10:03 ` Nicholas Piggin
  2017-11-06 10:44   ` Aneesh Kumar K.V
  2017-11-06 10:03 ` [PATCH 4/5] powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V, Florian Weimer

Hash unconditionally resets the addr_limit to default (128TB) when
the mm context is initialised. If a process has > 128TB mappings when
it forks, the child will not get the 512TB addr_limit, so accesses to
valid > 128TB mappings will fail in the child.

Fix this by only resetting the addr_limit to default if it was 0. Non
zero indicates it was duplicated from the parent (0 means exec()).

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/mmu_context_book3s64.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index 05e15386d4cb..b94fb62e60fd 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -93,11 +93,11 @@ static int hash__init_new_context(struct mm_struct *mm)
 		return index;
 
 	/*
-	 * We do switch_slb() early in fork, even before we setup the
-	 * mm->context.addr_limit. Default to max task size so that we copy the
-	 * default values to paca which will help us to handle slb miss early.
+	 * In the case of exec, use the default limit,
+	 * otherwise inherit it from the mm we are duplicating.
 	 */
-	mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
+	if (!mm->context.addr_limit)
+		mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
 
 	/*
 	 * The old code would re-promote on fork, we don't do that when using
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 10:03 [PATCH 0/5] VA allocator fixes Nicholas Piggin
                   ` (2 preceding siblings ...)
  2017-11-06 10:03 ` [PATCH 3/5] powerpc/64s/hash: Fix fork() with 512TB process address space Nicholas Piggin
@ 2017-11-06 10:03 ` Nicholas Piggin
  2017-11-06 11:14   ` Aneesh Kumar K.V
  2017-11-06 10:03 ` [PATCH 5/5] powerpc/64s: mm_context.addr_limit is only used on hash Nicholas Piggin
  2017-11-06 15:16 ` [PATCH 0/5] VA allocator fixes Florian Weimer
  5 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V, Florian Weimer

Radix VA space allocations test addresses against mm->task_size which is
512TB, even in cases where the intention is to limit allocation to below
128TB.

This results in mmap with a hint address below 128TB but address + length
above 128TB succeeding when it should fail (as hash does after the
previous patch).

Set the high address limit to be considered up front, and base subsequent
allocation checks on that consistently.

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/hugetlbpage-radix.c | 13 +++++++------
 arch/powerpc/mm/mmap.c              | 27 ++++++++++++++-------------
 2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage-radix.c b/arch/powerpc/mm/hugetlbpage-radix.c
index a12e86395025..9c6a411e9c85 100644
--- a/arch/powerpc/mm/hugetlbpage-radix.c
+++ b/arch/powerpc/mm/hugetlbpage-radix.c
@@ -48,14 +48,18 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	struct hstate *h = hstate_file(file);
+	unsigned long high_limit = DEFAULT_MAP_WINDOW;
 	struct vm_unmapped_area_info info;
 
 	if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
 		mm->context.addr_limit = TASK_SIZE;
 
+	if (addr > high_limit)
+		high_limit = TASK_SIZE;
+
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (len > mm->task_size)
+	if (len > high_limit)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED) {
@@ -67,7 +71,7 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	if (addr) {
 		addr = ALIGN(addr, huge_page_size(h));
 		vma = find_vma(mm, addr);
-		if (mm->task_size - len >= addr &&
+		if (high_limit - len >= addr &&
 		    (!vma || addr + len <= vm_start_gap(vma)))
 			return addr;
 	}
@@ -78,12 +82,9 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
-	info.high_limit = current->mm->mmap_base;
+	info.high_limit = mm->mmap_base + (high_limit - DEFAULT_MAP_WINDOW);
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 
-	if (addr > DEFAULT_MAP_WINDOW)
-		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
-
 	return vm_unmapped_area(&info);
 }
diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index 5d78b193fec4..e6cb3b3f7e93 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -106,13 +106,17 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
+	unsigned long high_limit = DEFAULT_MAP_WINDOW;
 	struct vm_unmapped_area_info info;
 
 	if (unlikely(addr > mm->context.addr_limit &&
 		     mm->context.addr_limit != TASK_SIZE))
 		mm->context.addr_limit = TASK_SIZE;
 
-	if (len > mm->task_size - mmap_min_addr)
+	if (addr > high_limit)
+		high_limit = TASK_SIZE;
+
+	if (len > high_limit - mmap_min_addr)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED)
@@ -121,7 +125,7 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
 		vma = find_vma(mm, addr);
-		if (mm->task_size - len >= addr && addr >= mmap_min_addr &&
+		if (high_limit - len >= addr && addr >= mmap_min_addr &&
 		    (!vma || addr + len <= vm_start_gap(vma)))
 			return addr;
 	}
@@ -129,13 +133,9 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.flags = 0;
 	info.length = len;
 	info.low_limit = mm->mmap_base;
+	info.high_limit = high_limit;
 	info.align_mask = 0;
 
-	if (unlikely(addr > DEFAULT_MAP_WINDOW))
-		info.high_limit = mm->context.addr_limit;
-	else
-		info.high_limit = DEFAULT_MAP_WINDOW;
-
 	return vm_unmapped_area(&info);
 }
 
@@ -149,14 +149,18 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = current->mm;
 	unsigned long addr = addr0;
+	unsigned long high_limit = DEFAULT_MAP_WINDOW;
 	struct vm_unmapped_area_info info;
 
 	if (unlikely(addr > mm->context.addr_limit &&
 		     mm->context.addr_limit != TASK_SIZE))
 		mm->context.addr_limit = TASK_SIZE;
 
+	if (addr > high_limit)
+		high_limit = TASK_SIZE;
+
 	/* requested length too big for entire address space */
-	if (len > mm->task_size - mmap_min_addr)
+	if (len > high_limit - mmap_min_addr)
 		return -ENOMEM;
 
 	if (flags & MAP_FIXED)
@@ -166,7 +170,7 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	if (addr) {
 		addr = PAGE_ALIGN(addr);
 		vma = find_vma(mm, addr);
-		if (mm->task_size - len >= addr && addr >= mmap_min_addr &&
+		if (high_limit - len >= addr && addr >= mmap_min_addr &&
 				(!vma || addr + len <= vm_start_gap(vma)))
 			return addr;
 	}
@@ -174,12 +178,9 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
 	info.length = len;
 	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
-	info.high_limit = mm->mmap_base;
+	info.high_limit = mm->mmap_base + (high_limit - DEFAULT_MAP_WINDOW);
 	info.align_mask = 0;
 
-	if (addr > DEFAULT_MAP_WINDOW)
-		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
-
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
 		return addr;
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] powerpc/64s: mm_context.addr_limit is only used on hash
  2017-11-06 10:03 [PATCH 0/5] VA allocator fixes Nicholas Piggin
                   ` (3 preceding siblings ...)
  2017-11-06 10:03 ` [PATCH 4/5] powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
@ 2017-11-06 10:03 ` Nicholas Piggin
  2017-11-06 15:16 ` [PATCH 0/5] VA allocator fixes Florian Weimer
  5 siblings, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V, Florian Weimer

Radix keeps no meaningful state in addr_limit, so remove it from
radix code and rename to slb_addr_limit to make it clear it applies
to hash only.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  2 +-
 arch/powerpc/include/asm/book3s/64/mmu.h      |  2 +-
 arch/powerpc/include/asm/paca.h               |  2 +-
 arch/powerpc/kernel/asm-offsets.c             |  2 +-
 arch/powerpc/kernel/paca.c                    |  4 ++--
 arch/powerpc/kernel/setup-common.c            |  3 ++-
 arch/powerpc/mm/hugetlbpage-radix.c           |  3 ---
 arch/powerpc/mm/mmap.c                        |  8 -------
 arch/powerpc/mm/mmu_context_book3s64.c        |  4 ++--
 arch/powerpc/mm/slb_low.S                     |  2 +-
 arch/powerpc/mm/slice.c                       | 34 +++++++++++++--------------
 11 files changed, 28 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 508275bb05d5..e91e115a816f 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -606,7 +606,7 @@ extern void slb_set_size(u16 size);
 
 /* 4 bits per slice and we have one slice per 1TB */
 #define SLICE_ARRAY_SIZE	(H_PGTABLE_RANGE >> 41)
-#define TASK_SLICE_ARRAY_SZ(x)	((x)->context.addr_limit >> 41)
+#define TASK_SLICE_ARRAY_SZ(x)	((x)->context.slb_addr_limit >> 41)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index c3b00e8ff791..49a07c5d9e50 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -92,7 +92,7 @@ typedef struct {
 #ifdef CONFIG_PPC_MM_SLICES
 	u64 low_slices_psize;	/* SLB page size encodings */
 	unsigned char high_slices_psize[SLICE_ARRAY_SIZE];
-	unsigned long addr_limit;
+	unsigned long slb_addr_limit;
 #else
 	u16 sllp;		/* SLB page size encoding */
 #endif
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 7125efa6a6ae..2ef0c0da4bb7 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -143,7 +143,7 @@ struct paca_struct {
 #ifdef CONFIG_PPC_MM_SLICES
 	u64 mm_ctx_low_slices_psize;
 	unsigned char mm_ctx_high_slices_psize[SLICE_ARRAY_SIZE];
-	unsigned long addr_limit;
+	unsigned long mm_ctx_slb_addr_limit;
 #else
 	u16 mm_ctx_user_psize;
 	u16 mm_ctx_sllp;
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 96c52235ecdc..912880873dfc 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -185,7 +185,7 @@ int main(void)
 #ifdef CONFIG_PPC_MM_SLICES
 	OFFSET(PACALOWSLICESPSIZE, paca_struct, mm_ctx_low_slices_psize);
 	OFFSET(PACAHIGHSLICEPSIZE, paca_struct, mm_ctx_high_slices_psize);
-	DEFINE(PACA_ADDR_LIMIT, offsetof(struct paca_struct, addr_limit));
+	OFFSET(PACA_SLB_ADDR_LIMIT, paca_struct, mm_ctx_slb_addr_limit);
 	DEFINE(MMUPSIZEDEFSIZE, sizeof(struct mmu_psize_def));
 #endif /* CONFIG_PPC_MM_SLICES */
 #endif
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 2ff2b8a19f71..4c69d335863c 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -262,8 +262,8 @@ void copy_mm_to_paca(struct mm_struct *mm)
 
 	get_paca()->mm_ctx_id = context->id;
 #ifdef CONFIG_PPC_MM_SLICES
-	VM_BUG_ON(!mm->context.addr_limit);
-	get_paca()->addr_limit = mm->context.addr_limit;
+	VM_BUG_ON(!mm->context.slb_addr_limit);
+	get_paca()->mm_ctx_slb_addr_limit = mm->context.slb_addr_limit;
 	get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
 	memcpy(&get_paca()->mm_ctx_high_slices_psize,
 	       &context->high_slices_psize, TASK_SLICE_ARRAY_SZ(mm));
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 2e3bc16d02b2..8c4fa6086b39 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -898,7 +898,8 @@ void __init setup_arch(char **cmdline_p)
 
 #ifdef CONFIG_PPC_MM_SLICES
 #ifdef CONFIG_PPC64
-	init_mm.context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
+	if (!radix_enabled())
+		init_mm.context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64;
 #else
 #error	"context.addr_limit not initialized."
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage-radix.c b/arch/powerpc/mm/hugetlbpage-radix.c
index 9c6a411e9c85..0f69bdf33367 100644
--- a/arch/powerpc/mm/hugetlbpage-radix.c
+++ b/arch/powerpc/mm/hugetlbpage-radix.c
@@ -51,9 +51,6 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	unsigned long high_limit = DEFAULT_MAP_WINDOW;
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
-
 	if (addr > high_limit)
 		high_limit = TASK_SIZE;
 
diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index e6cb3b3f7e93..62c1191670b0 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -109,10 +109,6 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	unsigned long high_limit = DEFAULT_MAP_WINDOW;
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit &&
-		     mm->context.addr_limit != TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
-
 	if (addr > high_limit)
 		high_limit = TASK_SIZE;
 
@@ -152,10 +148,6 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	unsigned long high_limit = DEFAULT_MAP_WINDOW;
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit &&
-		     mm->context.addr_limit != TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
-
 	if (addr > high_limit)
 		high_limit = TASK_SIZE;
 
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index b94fb62e60fd..995c98f3f392 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -96,8 +96,8 @@ static int hash__init_new_context(struct mm_struct *mm)
 	 * In the case of exec, use the default limit,
 	 * otherwise inherit it from the mm we are duplicating.
 	 */
-	if (!mm->context.addr_limit)
-		mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
+	if (!mm->context.slb_addr_limit)
+		mm->context.slb_addr_limit = DEFAULT_MAP_WINDOW_USER64;
 
 	/*
 	 * The old code would re-promote on fork, we don't do that when using
diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S
index 906a86fe457b..7046bb389704 100644
--- a/arch/powerpc/mm/slb_low.S
+++ b/arch/powerpc/mm/slb_low.S
@@ -167,7 +167,7 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_1T_SEGMENT)
         /*
          * user space make sure we are within the allowed limit
 	 */
-	ld	r11,PACA_ADDR_LIMIT(r13)
+	ld	r11,PACA_SLB_ADDR_LIMIT(r13)
 	cmpld	r3,r11
 	bge-	8f
 
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index f980397b449d..4933d5c71d57 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -96,7 +96,7 @@ static int slice_area_is_free(struct mm_struct *mm, unsigned long addr,
 {
 	struct vm_area_struct *vma;
 
-	if ((mm->context.addr_limit - len) < addr)
+	if ((mm->context.slb_addr_limit - len) < addr)
 		return 0;
 	vma = find_vma(mm, addr);
 	return (!vma || (addr + len) <= vm_start_gap(vma));
@@ -133,10 +133,10 @@ static void slice_mask_for_free(struct mm_struct *mm, struct slice_mask *ret)
 		if (!slice_low_has_vma(mm, i))
 			ret->low_slices |= 1u << i;
 
-	if (mm->context.addr_limit <= SLICE_LOW_TOP)
+	if (mm->context.slb_addr_limit <= SLICE_LOW_TOP)
 		return;
 
-	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.addr_limit); i++)
+	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.slb_addr_limit); i++)
 		if (!slice_high_has_vma(mm, i))
 			__set_bit(i, ret->high_slices);
 }
@@ -157,7 +157,7 @@ static void slice_mask_for_size(struct mm_struct *mm, int psize, struct slice_ma
 			ret->low_slices |= 1u << i;
 
 	hpsizes = mm->context.high_slices_psize;
-	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.addr_limit); i++) {
+	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.slb_addr_limit); i++) {
 		mask_index = i & 0x1;
 		index = i >> 1;
 		if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize)
@@ -169,7 +169,7 @@ static int slice_check_fit(struct mm_struct *mm,
 			   struct slice_mask mask, struct slice_mask available)
 {
 	DECLARE_BITMAP(result, SLICE_NUM_HIGH);
-	unsigned long slice_count = GET_HIGH_SLICE_INDEX(mm->context.addr_limit);
+	unsigned long slice_count = GET_HIGH_SLICE_INDEX(mm->context.slb_addr_limit);
 
 	bitmap_and(result, mask.high_slices,
 		   available.high_slices, slice_count);
@@ -219,7 +219,7 @@ static void slice_convert(struct mm_struct *mm, struct slice_mask mask, int psiz
 	mm->context.low_slices_psize = lpsizes;
 
 	hpsizes = mm->context.high_slices_psize;
-	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.addr_limit); i++) {
+	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.slb_addr_limit); i++) {
 		mask_index = i & 0x1;
 		index = i >> 1;
 		if (test_bit(i, mask.high_slices))
@@ -329,8 +329,8 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm,
 	 * Only for that request for which high_limit is above
 	 * DEFAULT_MAP_WINDOW we should apply this.
 	 */
-	if (high_limit  > DEFAULT_MAP_WINDOW)
-		addr += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	if (high_limit > DEFAULT_MAP_WINDOW)
+		addr += mm->context.slb_addr_limit - DEFAULT_MAP_WINDOW;
 
 	while (addr > PAGE_SIZE) {
 		info.high_limit = addr;
@@ -419,17 +419,17 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 	/*
 	 * Check if we need to expland slice area.
 	 */
-	if (unlikely(((addr > mm->context.addr_limit) ||
-			(fixed && addr + len > mm->context.addr_limit)) &&
-		     mm->context.addr_limit != TASK_SIZE)) {
-		mm->context.addr_limit = TASK_SIZE;
+	if (unlikely(((addr > mm->context.slb_addr_limit) ||
+			(fixed && addr + len > mm->context.slb_addr_limit)) &&
+		     mm->context.slb_addr_limit != TASK_SIZE)) {
+		mm->context.slb_addr_limit = TASK_SIZE;
 		on_each_cpu(slice_flush_segments, mm, 1);
 	}
 	/*
 	 * This mmap request can allocate upt to 512TB
 	 */
 	if (addr > DEFAULT_MAP_WINDOW)
-		high_limit = mm->context.addr_limit;
+		high_limit = mm->context.slb_addr_limit;
 	else
 		high_limit = DEFAULT_MAP_WINDOW;
 	/*
@@ -447,20 +447,20 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 
 	/* Sanity checks */
 	BUG_ON(mm->task_size == 0);
-	BUG_ON(mm->context.addr_limit == 0);
+	BUG_ON(mm->context.slb_addr_limit == 0);
 	VM_BUG_ON(radix_enabled());
 
 	slice_dbg("slice_get_unmapped_area(mm=%p, psize=%d...\n", mm, psize);
 	slice_dbg(" addr=%lx, len=%lx, flags=%lx, topdown=%d\n",
 		  addr, len, flags, topdown);
 
-	if (len > mm->context.addr_limit)
+	if (len > mm->context.slb_addr_limit)
 		return -ENOMEM;
 	if (len & ((1ul << pshift) - 1))
 		return -EINVAL;
 	if (fixed && (addr & ((1ul << pshift) - 1)))
 		return -EINVAL;
-	if (fixed && addr > (mm->context.addr_limit - len))
+	if (fixed && addr > (mm->context.slb_addr_limit - len))
 		return -ENOMEM;
 
 	/* If hint, make sure it matches our alignment restrictions */
@@ -468,7 +468,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 		addr = _ALIGN_UP(addr, 1ul << pshift);
 		slice_dbg(" aligned addr=%lx\n", addr);
 		/* Ignore hint if it's too large or overlaps a VMA */
-		if (addr > mm->context.addr_limit - len ||
+		if (addr > mm->context.slb_addr_limit - len ||
 		    !slice_area_is_free(mm, addr, len))
 			addr = 0;
 	}
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 10:03 ` [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
@ 2017-11-06 10:38   ` Aneesh Kumar K.V
  2017-11-06 10:54     ` Nicholas Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06 10:38 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin, Florian Weimer

Nicholas Piggin <npiggin@gmail.com> writes:

> When allocating VA space with a hint that crosses 128TB, the SLB addr_limit
> variable is not expanded if addr is not > 128TB, but the slice allocation
> looks at task_size, which is 512TB. This results in slice_check_fit()
> incorrectly succeeding because the slice_count truncates off bit 128 of the
> requested mask, so the comparison to the available mask succeeds.


But then the mask passed to slice_check_fit() is generated using
context.addr_limit as max value. So how did that return succcess? ie,
we get the request mask via

slice_range_to_mask(addr, len, &mask);

And the potential/possible mask using

slice_mask_for_size(mm, psize, &good_mask);

So how did slice_check_fit() return sucess with

slice_check_fit(mm, mask, good_mask);


>
> Fix this by using mm->context.addr_limit instead of mm->task_size for
> testing allocation limits. This causes such allocations to fail.
>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
> Reported-by: Florian Weimer <fweimer@redhat.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/mm/slice.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
> index 45f6740dd407..567db541c0a1 100644
> --- a/arch/powerpc/mm/slice.c
> +++ b/arch/powerpc/mm/slice.c
> @@ -96,7 +96,7 @@ static int slice_area_is_free(struct mm_struct *mm, unsigned long addr,
>  {
>  	struct vm_area_struct *vma;
>
> -	if ((mm->task_size - len) < addr)
> +	if ((mm->context.addr_limit - len) < addr)

I was looking at these as generic boundary check against task size and
for specific range check we should have created mask always using
context.addr_limit. That should keep the boundary condition check same
across radix/hash.

>  		return 0;
>  	vma = find_vma(mm, addr);
>  	return (!vma || (addr + len) <= vm_start_gap(vma));
> @@ -133,7 +133,7 @@ static void slice_mask_for_free(struct mm_struct *mm, struct slice_mask *ret)
>  		if (!slice_low_has_vma(mm, i))
>  			ret->low_slices |= 1u << i;
>
> -	if (mm->task_size <= SLICE_LOW_TOP)
> +	if (mm->context.addr_limit <= SLICE_LOW_TOP)
>  		return;
>
>  	for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->context.addr_limit); i++)
> @@ -446,19 +446,20 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
>
>  	/* Sanity checks */
>  	BUG_ON(mm->task_size == 0);
> +	BUG_ON(mm->context.addr_limit == 0);
>  	VM_BUG_ON(radix_enabled());
>
>  	slice_dbg("slice_get_unmapped_area(mm=%p, psize=%d...\n", mm, psize);
>  	slice_dbg(" addr=%lx, len=%lx, flags=%lx, topdown=%d\n",
>  		  addr, len, flags, topdown);
>
> -	if (len > mm->task_size)
> +	if (len > mm->context.addr_limit)
>  		return -ENOMEM;
>  	if (len & ((1ul << pshift) - 1))
>  		return -EINVAL;
>  	if (fixed && (addr & ((1ul << pshift) - 1)))
>  		return -EINVAL;
> -	if (fixed && addr > (mm->task_size - len))
> +	if (fixed && addr > (mm->context.addr_limit - len))
>  		return -ENOMEM;
>
>  	/* If hint, make sure it matches our alignment restrictions */
> @@ -466,7 +467,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
>  		addr = _ALIGN_UP(addr, 1ul << pshift);
>  		slice_dbg(" aligned addr=%lx\n", addr);
>  		/* Ignore hint if it's too large or overlaps a VMA */
> -		if (addr > mm->task_size - len ||
> +		if (addr > mm->context.addr_limit - len ||
>  		    !slice_area_is_free(mm, addr, len))
>  			addr = 0;
>  	}
> -- 
> 2.15.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
  2017-11-06 10:03 ` [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary Nicholas Piggin
@ 2017-11-06 10:44   ` Aneesh Kumar K.V
  2017-11-06 11:55     ` Nicholas Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06 10:44 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev, Kirill A. Shutemov
  Cc: Nicholas Piggin, Florian Weimer

Nicholas Piggin <npiggin@gmail.com> writes:

> While mapping hints with a length that cross 128TB are disallowed,
> MAP_FIXED allocations that cross 128TB are allowed. These are failing
> on hash (on radix they succeed). Add an additional case for fixed
> mappings to expand the addr_limit when crossing 128TB.

Shouldn't that be fixed in radix. But i see x86 also doing this?


	if (flags & MAP_FIXED)
		return addr;

Kiril,

Is that expected?


>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/mm/slice.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
> index 567db541c0a1..f980397b449d 100644
> --- a/arch/powerpc/mm/slice.c
> +++ b/arch/powerpc/mm/slice.c
> @@ -419,7 +419,8 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
>  	/*
>  	 * Check if we need to expland slice area.
>  	 */
> -	if (unlikely(addr > mm->context.addr_limit &&
> +	if (unlikely(((addr > mm->context.addr_limit) ||
> +			(fixed && addr + len > mm->context.addr_limit)) &&
>  		     mm->context.addr_limit != TASK_SIZE)) {
>  		mm->context.addr_limit = TASK_SIZE;
>  		on_each_cpu(slice_flush_segments, mm, 1);
> -- 
> 2.15.0


-aneesh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] powerpc/64s/hash: Fix fork() with 512TB process address space
  2017-11-06 10:03 ` [PATCH 3/5] powerpc/64s/hash: Fix fork() with 512TB process address space Nicholas Piggin
@ 2017-11-06 10:44   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06 10:44 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin, Florian Weimer

Nicholas Piggin <npiggin@gmail.com> writes:

> Hash unconditionally resets the addr_limit to default (128TB) when
> the mm context is initialised. If a process has > 128TB mappings when
> it forks, the child will not get the 512TB addr_limit, so accesses to
> valid > 128TB mappings will fail in the child.
>
> Fix this by only resetting the addr_limit to default if it was 0. Non
> zero indicates it was duplicated from the parent (0 means exec()).
>

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/mm/mmu_context_book3s64.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index 05e15386d4cb..b94fb62e60fd 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -93,11 +93,11 @@ static int hash__init_new_context(struct mm_struct *mm)
>  		return index;
>
>  	/*
> -	 * We do switch_slb() early in fork, even before we setup the
> -	 * mm->context.addr_limit. Default to max task size so that we copy the
> -	 * default values to paca which will help us to handle slb miss early.
> +	 * In the case of exec, use the default limit,
> +	 * otherwise inherit it from the mm we are duplicating.
>  	 */
> -	mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
> +	if (!mm->context.addr_limit)
> +		mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
>
>  	/*
>  	 * The old code would re-promote on fork, we don't do that when using
> -- 
> 2.15.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 10:38   ` Aneesh Kumar K.V
@ 2017-11-06 10:54     ` Nicholas Piggin
  2017-11-06 11:05       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Florian Weimer

On Mon, 06 Nov 2017 16:08:06 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > When allocating VA space with a hint that crosses 128TB, the SLB addr_limit
> > variable is not expanded if addr is not > 128TB, but the slice allocation
> > looks at task_size, which is 512TB. This results in slice_check_fit()
> > incorrectly succeeding because the slice_count truncates off bit 128 of the
> > requested mask, so the comparison to the available mask succeeds.  
> 
> 
> But then the mask passed to slice_check_fit() is generated using
> context.addr_limit as max value. So how did that return succcess? ie,
> we get the request mask via
> 
> slice_range_to_mask(addr, len, &mask);
> 
> And the potential/possible mask using
> 
> slice_mask_for_size(mm, psize, &good_mask);
> 
> So how did slice_check_fit() return sucess with
> 
> slice_check_fit(mm, mask, good_mask);

Because the addr_limit check is used to *limit* the comparison.

The available mask had bit up to 127 set, and the mask had 127 and
128 set. However the 128T addr_limit causes only bits 0-127 to be
compared.

> > Fix this by using mm->context.addr_limit instead of mm->task_size for
> > testing allocation limits. This causes such allocations to fail.
> >
> > Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> > Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
> > Reported-by: Florian Weimer <fweimer@redhat.com>
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >  arch/powerpc/mm/slice.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
> > index 45f6740dd407..567db541c0a1 100644
> > --- a/arch/powerpc/mm/slice.c
> > +++ b/arch/powerpc/mm/slice.c
> > @@ -96,7 +96,7 @@ static int slice_area_is_free(struct mm_struct *mm, unsigned long addr,
> >  {
> >  	struct vm_area_struct *vma;
> >
> > -	if ((mm->task_size - len) < addr)
> > +	if ((mm->context.addr_limit - len) < addr)  
> 
> I was looking at these as generic boundary check against task size and
> for specific range check we should have created mask always using
> context.addr_limit. That should keep the boundary condition check same
> across radix/hash.

We need to actually fix the radix case too for other-but-similar reasons,
so fixing it like this does end up with the same tests for both. See
the later radix patch.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 10:54     ` Nicholas Piggin
@ 2017-11-06 11:05       ` Aneesh Kumar K.V
  2017-11-06 11:21         ` Nicholas Piggin
  2017-11-07  2:00         ` Aneesh Kumar K.V
  0 siblings, 2 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06 11:05 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev, Florian Weimer



On 11/06/2017 04:24 PM, Nicholas Piggin wrote:
> On Mon, 06 Nov 2017 16:08:06 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>
>>> When allocating VA space with a hint that crosses 128TB, the SLB addr_limit
>>> variable is not expanded if addr is not > 128TB, but the slice allocation
>>> looks at task_size, which is 512TB. This results in slice_check_fit()
>>> incorrectly succeeding because the slice_count truncates off bit 128 of the
>>> requested mask, so the comparison to the available mask succeeds.
>>
>>
>> But then the mask passed to slice_check_fit() is generated using
>> context.addr_limit as max value. So how did that return succcess? ie,
>> we get the request mask via
>>
>> slice_range_to_mask(addr, len, &mask);
>>
>> And the potential/possible mask using
>>
>> slice_mask_for_size(mm, psize, &good_mask);
>>
>> So how did slice_check_fit() return sucess with
>>
>> slice_check_fit(mm, mask, good_mask);
> 
> Because the addr_limit check is used to *limit* the comparison.
> 
> The available mask had bit up to 127 set, and the mask had 127 and
> 128 set. However the 128T addr_limit causes only bits 0-127 to be
> compared.
>

Should we fix it then via ? I haven't tested this yet. Also this result 
in us comparing more bits?

modified   arch/powerpc/mm/slice.c
@@ -169,13 +169,12 @@ static int slice_check_fit(struct mm_struct *mm,
  			   struct slice_mask mask, struct slice_mask available)
  {
  	DECLARE_BITMAP(result, SLICE_NUM_HIGH);
-	unsigned long slice_count = GET_HIGH_SLICE_INDEX(mm->context.addr_limit);

  	bitmap_and(result, mask.high_slices,
-		   available.high_slices, slice_count);
+		   available.high_slices, SLICE_NUM_HIGH);

  	return (mask.low_slices & available.low_slices) == mask.low_slices &&
-		bitmap_equal(result, mask.high_slices, slice_count);
+		bitmap_equal(result, mask.high_slices, SLICE_NUM_HIGH)


-aneesh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 10:03 ` [PATCH 4/5] powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
@ 2017-11-06 11:14   ` Aneesh Kumar K.V
  2017-11-06 11:42     ` Nicholas Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06 11:14 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin, Florian Weimer

Nicholas Piggin <npiggin@gmail.com> writes:

> Radix VA space allocations test addresses against mm->task_size which is
> 512TB, even in cases where the intention is to limit allocation to below
> 128TB.
>
> This results in mmap with a hint address below 128TB but address + length
> above 128TB succeeding when it should fail (as hash does after the
> previous patch).
>
> Set the high address limit to be considered up front, and base subsequent
> allocation checks on that consistently.

Doesn't setting info.high_limit take care of that ? I would expect
vm_unmapped_area to fail based on info.high_limit. Is this with MAP_FIXED?


>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/mm/hugetlbpage-radix.c | 13 +++++++------
>  arch/powerpc/mm/mmap.c              | 27 ++++++++++++++-------------
>  2 files changed, 21 insertions(+), 19 deletions(-)
>
> diff --git a/arch/powerpc/mm/hugetlbpage-radix.c b/arch/powerpc/mm/hugetlbpage-radix.c
> index a12e86395025..9c6a411e9c85 100644
> --- a/arch/powerpc/mm/hugetlbpage-radix.c
> +++ b/arch/powerpc/mm/hugetlbpage-radix.c
> @@ -48,14 +48,18 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma;
>  	struct hstate *h = hstate_file(file);
> +	unsigned long high_limit = DEFAULT_MAP_WINDOW;
>  	struct vm_unmapped_area_info info;
>
>  	if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
>  		mm->context.addr_limit = TASK_SIZE;
>
> +	if (addr > high_limit)
> +		high_limit = TASK_SIZE;
> +
>  	if (len & ~huge_page_mask(h))
>  		return -EINVAL;
> -	if (len > mm->task_size)
> +	if (len > high_limit)
>  		return -ENOMEM;
>
>  	if (flags & MAP_FIXED) {
> @@ -67,7 +71,7 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>  	if (addr) {
>  		addr = ALIGN(addr, huge_page_size(h));
>  		vma = find_vma(mm, addr);
> -		if (mm->task_size - len >= addr &&
> +		if (high_limit - len >= addr &&
>  		    (!vma || addr + len <= vm_start_gap(vma)))
>  			return addr;
>  	}
> @@ -78,12 +82,9 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = PAGE_SIZE;
> -	info.high_limit = current->mm->mmap_base;
> +	info.high_limit = mm->mmap_base + (high_limit - DEFAULT_MAP_WINDOW);
>  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>  	info.align_offset = 0;
>
> -	if (addr > DEFAULT_MAP_WINDOW)
> -		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
> -
>  	return vm_unmapped_area(&info);
>  }
> diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
> index 5d78b193fec4..e6cb3b3f7e93 100644
> --- a/arch/powerpc/mm/mmap.c
> +++ b/arch/powerpc/mm/mmap.c
> @@ -106,13 +106,17 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  {
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma;
> +	unsigned long high_limit = DEFAULT_MAP_WINDOW;
>  	struct vm_unmapped_area_info info;
>
>  	if (unlikely(addr > mm->context.addr_limit &&
>  		     mm->context.addr_limit != TASK_SIZE))
>  		mm->context.addr_limit = TASK_SIZE;
>
> -	if (len > mm->task_size - mmap_min_addr)
> +	if (addr > high_limit)
> +		high_limit = TASK_SIZE;
> +
> +	if (len > high_limit - mmap_min_addr)
>  		return -ENOMEM;
>
>  	if (flags & MAP_FIXED)
> @@ -121,7 +125,7 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	if (addr) {
>  		addr = PAGE_ALIGN(addr);
>  		vma = find_vma(mm, addr);
> -		if (mm->task_size - len >= addr && addr >= mmap_min_addr &&
> +		if (high_limit - len >= addr && addr >= mmap_min_addr &&
>  		    (!vma || addr + len <= vm_start_gap(vma)))
>  			return addr;
>  	}
> @@ -129,13 +133,9 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
>  	info.flags = 0;
>  	info.length = len;
>  	info.low_limit = mm->mmap_base;
> +	info.high_limit = high_limit;
>  	info.align_mask = 0;
>
> -	if (unlikely(addr > DEFAULT_MAP_WINDOW))
> -		info.high_limit = mm->context.addr_limit;
> -	else
> -		info.high_limit = DEFAULT_MAP_WINDOW;
> -
>  	return vm_unmapped_area(&info);
>  }
>
> @@ -149,14 +149,18 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
>  	struct vm_area_struct *vma;
>  	struct mm_struct *mm = current->mm;
>  	unsigned long addr = addr0;
> +	unsigned long high_limit = DEFAULT_MAP_WINDOW;
>  	struct vm_unmapped_area_info info;
>
>  	if (unlikely(addr > mm->context.addr_limit &&
>  		     mm->context.addr_limit != TASK_SIZE))
>  		mm->context.addr_limit = TASK_SIZE;
>
> +	if (addr > high_limit)
> +		high_limit = TASK_SIZE;
> +
>  	/* requested length too big for entire address space */
> -	if (len > mm->task_size - mmap_min_addr)
> +	if (len > high_limit - mmap_min_addr)
>  		return -ENOMEM;
>
>  	if (flags & MAP_FIXED)
> @@ -166,7 +170,7 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
>  	if (addr) {
>  		addr = PAGE_ALIGN(addr);
>  		vma = find_vma(mm, addr);
> -		if (mm->task_size - len >= addr && addr >= mmap_min_addr &&
> +		if (high_limit - len >= addr && addr >= mmap_min_addr &&
>  				(!vma || addr + len <= vm_start_gap(vma)))
>  			return addr;
>  	}
> @@ -174,12 +178,9 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
>  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>  	info.length = len;
>  	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> -	info.high_limit = mm->mmap_base;
> +	info.high_limit = mm->mmap_base + (high_limit - DEFAULT_MAP_WINDOW);
>  	info.align_mask = 0;
>
> -	if (addr > DEFAULT_MAP_WINDOW)
> -		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
> -
>  	addr = vm_unmapped_area(&info);
>  	if (!(addr & ~PAGE_MASK))
>  		return addr;
> -- 
> 2.15.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 11:05       ` Aneesh Kumar K.V
@ 2017-11-06 11:21         ` Nicholas Piggin
  2017-11-07  2:00         ` Aneesh Kumar K.V
  1 sibling, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 11:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Florian Weimer

On Mon, 6 Nov 2017 16:35:43 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> On 11/06/2017 04:24 PM, Nicholas Piggin wrote:
> > On Mon, 06 Nov 2017 16:08:06 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> >   
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>  
> >>> When allocating VA space with a hint that crosses 128TB, the SLB addr_limit
> >>> variable is not expanded if addr is not > 128TB, but the slice allocation
> >>> looks at task_size, which is 512TB. This results in slice_check_fit()
> >>> incorrectly succeeding because the slice_count truncates off bit 128 of the
> >>> requested mask, so the comparison to the available mask succeeds.  
> >>
> >>
> >> But then the mask passed to slice_check_fit() is generated using
> >> context.addr_limit as max value. So how did that return succcess? ie,
> >> we get the request mask via
> >>
> >> slice_range_to_mask(addr, len, &mask);
> >>
> >> And the potential/possible mask using
> >>
> >> slice_mask_for_size(mm, psize, &good_mask);
> >>
> >> So how did slice_check_fit() return sucess with
> >>
> >> slice_check_fit(mm, mask, good_mask);  
> > 
> > Because the addr_limit check is used to *limit* the comparison.
> > 
> > The available mask had bit up to 127 set, and the mask had 127 and
> > 128 set. However the 128T addr_limit causes only bits 0-127 to be
> > compared.
> >  
> 
> Should we fix it then via ? I haven't tested this yet. Also this result 
> in us comparing more bits?

I prefer not to rely on that as the fix because we should not be calling
into the slice code with an address beyond addr_limit IMO. There's quite
a few other places that use addr_limit. So I prefer my patch.

You could add this as an extra check, but yes it does result in more bitmap
to test. So if anything I would prefer to go the other way and actually
reduce the scope of *other* bitmap operations that are now using
SLICE_NUM_HIGH by similarly using addr_limit (if there are other
performance critical ones).

We could add some VM_BUG_ON checks to ensure tail bits are zero if
that's a concern.

> 
> modified   arch/powerpc/mm/slice.c
> @@ -169,13 +169,12 @@ static int slice_check_fit(struct mm_struct *mm,
>   			   struct slice_mask mask, struct slice_mask available)
>   {
>   	DECLARE_BITMAP(result, SLICE_NUM_HIGH);
> -	unsigned long slice_count = GET_HIGH_SLICE_INDEX(mm->context.addr_limit);
> 
>   	bitmap_and(result, mask.high_slices,
> -		   available.high_slices, slice_count);
> +		   available.high_slices, SLICE_NUM_HIGH);
> 
>   	return (mask.low_slices & available.low_slices) == mask.low_slices &&
> -		bitmap_equal(result, mask.high_slices, slice_count);
> +		bitmap_equal(result, mask.high_slices, SLICE_NUM_HIGH)
> 
> 
> -aneesh
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 11:14   ` Aneesh Kumar K.V
@ 2017-11-06 11:42     ` Nicholas Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 11:42 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Florian Weimer

On Mon, 06 Nov 2017 16:44:48 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > Radix VA space allocations test addresses against mm->task_size which is
> > 512TB, even in cases where the intention is to limit allocation to below
> > 128TB.
> >
> > This results in mmap with a hint address below 128TB but address + length
> > above 128TB succeeding when it should fail (as hash does after the
> > previous patch).
> >
> > Set the high address limit to be considered up front, and base subsequent
> > allocation checks on that consistently.  
> 
> Doesn't setting info.high_limit take care of that ? I would expect
> vm_unmapped_area to fail based on info.high_limit.

No, it is the hint address case. info.high_limit only gets involved if
the hint area was unavailable.

I prefer the behaviour without this fix because I disagree that the explicit
address request should fail, but this is what you asked for.

Actually now I come to look again, it seems that generic code does *not*
fail in this case either! Any explicit hint will succeed if it partially
or completely crosses 128TB. This is much better behaviour, so I think
powerpc has it wrong.

> Is this with MAP_FIXED?

With MAP_FIXED, it remains as succeeding as expected (like generic code
and hash). I did not change that case.

> 
> 
> >
> > Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> > Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> >  arch/powerpc/mm/hugetlbpage-radix.c | 13 +++++++------
> >  arch/powerpc/mm/mmap.c              | 27 ++++++++++++++-------------
> >  2 files changed, 21 insertions(+), 19 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/hugetlbpage-radix.c b/arch/powerpc/mm/hugetlbpage-radix.c
> > index a12e86395025..9c6a411e9c85 100644
> > --- a/arch/powerpc/mm/hugetlbpage-radix.c
> > +++ b/arch/powerpc/mm/hugetlbpage-radix.c
> > @@ -48,14 +48,18 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> >  	struct mm_struct *mm = current->mm;
> >  	struct vm_area_struct *vma;
> >  	struct hstate *h = hstate_file(file);
> > +	unsigned long high_limit = DEFAULT_MAP_WINDOW;
> >  	struct vm_unmapped_area_info info;
> >
> >  	if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
> >  		mm->context.addr_limit = TASK_SIZE;
> >
> > +	if (addr > high_limit)
> > +		high_limit = TASK_SIZE;
> > +
> >  	if (len & ~huge_page_mask(h))
> >  		return -EINVAL;
> > -	if (len > mm->task_size)
> > +	if (len > high_limit)
> >  		return -ENOMEM;
> >
> >  	if (flags & MAP_FIXED) {
> > @@ -67,7 +71,7 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> >  	if (addr) {
> >  		addr = ALIGN(addr, huge_page_size(h));
> >  		vma = find_vma(mm, addr);
> > -		if (mm->task_size - len >= addr &&
> > +		if (high_limit - len >= addr &&
> >  		    (!vma || addr + len <= vm_start_gap(vma)))
> >  			return addr;
> >  	}
> > @@ -78,12 +82,9 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> >  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> >  	info.length = len;
> >  	info.low_limit = PAGE_SIZE;
> > -	info.high_limit = current->mm->mmap_base;
> > +	info.high_limit = mm->mmap_base + (high_limit - DEFAULT_MAP_WINDOW);
> >  	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
> >  	info.align_offset = 0;
> >
> > -	if (addr > DEFAULT_MAP_WINDOW)
> > -		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
> > -
> >  	return vm_unmapped_area(&info);
> >  }
> > diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
> > index 5d78b193fec4..e6cb3b3f7e93 100644
> > --- a/arch/powerpc/mm/mmap.c
> > +++ b/arch/powerpc/mm/mmap.c
> > @@ -106,13 +106,17 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
> >  {
> >  	struct mm_struct *mm = current->mm;
> >  	struct vm_area_struct *vma;
> > +	unsigned long high_limit = DEFAULT_MAP_WINDOW;
> >  	struct vm_unmapped_area_info info;
> >
> >  	if (unlikely(addr > mm->context.addr_limit &&
> >  		     mm->context.addr_limit != TASK_SIZE))
> >  		mm->context.addr_limit = TASK_SIZE;
> >
> > -	if (len > mm->task_size - mmap_min_addr)
> > +	if (addr > high_limit)
> > +		high_limit = TASK_SIZE;
> > +
> > +	if (len > high_limit - mmap_min_addr)
> >  		return -ENOMEM;
> >
> >  	if (flags & MAP_FIXED)
> > @@ -121,7 +125,7 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
> >  	if (addr) {
> >  		addr = PAGE_ALIGN(addr);
> >  		vma = find_vma(mm, addr);
> > -		if (mm->task_size - len >= addr && addr >= mmap_min_addr &&
> > +		if (high_limit - len >= addr && addr >= mmap_min_addr &&
> >  		    (!vma || addr + len <= vm_start_gap(vma)))
> >  			return addr;
> >  	}
> > @@ -129,13 +133,9 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
> >  	info.flags = 0;
> >  	info.length = len;
> >  	info.low_limit = mm->mmap_base;
> > +	info.high_limit = high_limit;
> >  	info.align_mask = 0;
> >
> > -	if (unlikely(addr > DEFAULT_MAP_WINDOW))
> > -		info.high_limit = mm->context.addr_limit;
> > -	else
> > -		info.high_limit = DEFAULT_MAP_WINDOW;
> > -
> >  	return vm_unmapped_area(&info);
> >  }
> >
> > @@ -149,14 +149,18 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
> >  	struct vm_area_struct *vma;
> >  	struct mm_struct *mm = current->mm;
> >  	unsigned long addr = addr0;
> > +	unsigned long high_limit = DEFAULT_MAP_WINDOW;
> >  	struct vm_unmapped_area_info info;
> >
> >  	if (unlikely(addr > mm->context.addr_limit &&
> >  		     mm->context.addr_limit != TASK_SIZE))
> >  		mm->context.addr_limit = TASK_SIZE;
> >
> > +	if (addr > high_limit)
> > +		high_limit = TASK_SIZE;
> > +
> >  	/* requested length too big for entire address space */
> > -	if (len > mm->task_size - mmap_min_addr)
> > +	if (len > high_limit - mmap_min_addr)
> >  		return -ENOMEM;
> >
> >  	if (flags & MAP_FIXED)
> > @@ -166,7 +170,7 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
> >  	if (addr) {
> >  		addr = PAGE_ALIGN(addr);
> >  		vma = find_vma(mm, addr);
> > -		if (mm->task_size - len >= addr && addr >= mmap_min_addr &&
> > +		if (high_limit - len >= addr && addr >= mmap_min_addr &&
> >  				(!vma || addr + len <= vm_start_gap(vma)))
> >  			return addr;
> >  	}
> > @@ -174,12 +178,9 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
> >  	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> >  	info.length = len;
> >  	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> > -	info.high_limit = mm->mmap_base;
> > +	info.high_limit = mm->mmap_base + (high_limit - DEFAULT_MAP_WINDOW);
> >  	info.align_mask = 0;
> >
> > -	if (addr > DEFAULT_MAP_WINDOW)
> > -		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
> > -
> >  	addr = vm_unmapped_area(&info);
> >  	if (!(addr & ~PAGE_MASK))
> >  		return addr;
> > -- 
> > 2.15.0  
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
  2017-11-06 10:44   ` Aneesh Kumar K.V
@ 2017-11-06 11:55     ` Nicholas Piggin
  2017-11-07  2:28       ` Michael Ellerman
  0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-06 11:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Kirill A. Shutemov, Florian Weimer

On Mon, 06 Nov 2017 16:14:10 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > While mapping hints with a length that cross 128TB are disallowed,
> > MAP_FIXED allocations that cross 128TB are allowed. These are failing
> > on hash (on radix they succeed). Add an additional case for fixed
> > mappings to expand the addr_limit when crossing 128TB.  
> 
> Shouldn't that be fixed in radix. But i see x86 also doing this?
> 
> 
> 	if (flags & MAP_FIXED)
> 		return addr;
> 
> Kiril,
> 
> Is that expected?

I should actually reply to this one because the other did not
have Kirill on cc.

Generic mapping code appears it will always succeed when given an
explicit hint request, even if the address is below the boundary
and address + length is above it. Even when !MAP_FIXED. This is the
sane behaviour AFAIKS. So we should switch powerpc to match,
shouldn't we?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] VA allocator fixes
  2017-11-06 10:03 [PATCH 0/5] VA allocator fixes Nicholas Piggin
                   ` (4 preceding siblings ...)
  2017-11-06 10:03 ` [PATCH 5/5] powerpc/64s: mm_context.addr_limit is only used on hash Nicholas Piggin
@ 2017-11-06 15:16 ` Florian Weimer
  2017-11-07  0:06   ` Nicholas Piggin
  5 siblings, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2017-11-06 15:16 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Aneesh Kumar K . V

On 11/06/2017 11:03 AM, Nicholas Piggin wrote:
> Florian found a nasty corner case with the VA allocation logic
> for crossing from 128TB to 512TB limit on hash, and made a
> really superb report of the problem -- traces, reproducer recipes,
> analysis, etc. which already mostly solved it.
> 
> The first patch in the series should solve Florian's particular
> case, the next 3 are other issues with addr_limit. The last
> patch is technically a cleanup but I think it's fairly important
> in terms of understanding the code and also enabling some BUG
> checks (when addr_limit == 0).
> 
> I have not tested these exactly on Florian's test case, but
> some tests of my own behave better afterwards. Hopefully he has
> time to re-test. Some careful review would be welcome too.

I think I have applied the five patches you posted, but I still get a 
brk value above 128 TiB:

# /lib64/ld64.so.1 ./a.out
initial brk value: 0x7fffde960000
probing at 0x80000001fffc

I assumed you wanted to reject those?

In either case, I recommend to tweak the VM layout, so that ld.so does 
not land closely to to the 128 TiB limit, so that the brk failure or 
returning of 48-bit addresses is avoided.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] VA allocator fixes
  2017-11-06 15:16 ` [PATCH 0/5] VA allocator fixes Florian Weimer
@ 2017-11-07  0:06   ` Nicholas Piggin
  2017-11-07  1:59     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-07  0:06 UTC (permalink / raw)
  To: Florian Weimer; +Cc: linuxppc-dev, Aneesh Kumar K . V, Kirill A . Shutemov

On Mon, 6 Nov 2017 16:16:07 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/06/2017 11:03 AM, Nicholas Piggin wrote:
> > Florian found a nasty corner case with the VA allocation logic
> > for crossing from 128TB to 512TB limit on hash, and made a
> > really superb report of the problem -- traces, reproducer recipes,
> > analysis, etc. which already mostly solved it.
> > 
> > The first patch in the series should solve Florian's particular
> > case, the next 3 are other issues with addr_limit. The last
> > patch is technically a cleanup but I think it's fairly important
> > in terms of understanding the code and also enabling some BUG
> > checks (when addr_limit == 0).
> > 
> > I have not tested these exactly on Florian's test case, but
> > some tests of my own behave better afterwards. Hopefully he has
> > time to re-test. Some careful review would be welcome too.  
> 
> I think I have applied the five patches you posted, but I still get a 
> brk value above 128 TiB:
> 
> # /lib64/ld64.so.1 ./a.out
> initial brk value: 0x7fffde960000
> probing at 0x80000001fffc
> 
> I assumed you wanted to reject those?

It was difficult to understand what the intended semantics are, but I
think brk should succeed (it is implemented with MAP_FIXED). Of course
it should not succeed then segfault when you try to access it.

> 
> In either case, I recommend to tweak the VM layout, so that ld.so does 
> not land closely to to the 128 TiB limit, so that the brk failure or 
> returning of 48-bit addresses is avoided.

Yeah well that's yet another issue. I was not really involved with the
address space extension work. Anees, Kirill, was the intention for the
128T->512T extension logic to be a no-op for all address space allocaiton
except those with explicit addresses?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] VA allocator fixes
  2017-11-07  0:06   ` Nicholas Piggin
@ 2017-11-07  1:59     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-07  1:59 UTC (permalink / raw)
  To: Nicholas Piggin, Florian Weimer; +Cc: linuxppc-dev, Kirill A . Shutemov



On 11/07/2017 05:36 AM, Nicholas Piggin wrote:
> On Mon, 6 Nov 2017 16:16:07 +0100
> Florian Weimer <fweimer@redhat.com> wrote:
> 
>> On 11/06/2017 11:03 AM, Nicholas Piggin wrote:
>>> Florian found a nasty corner case with the VA allocation logic
>>> for crossing from 128TB to 512TB limit on hash, and made a
>>> really superb report of the problem -- traces, reproducer recipes,
>>> analysis, etc. which already mostly solved it.
>>>
>>> The first patch in the series should solve Florian's particular
>>> case, the next 3 are other issues with addr_limit. The last
>>> patch is technically a cleanup but I think it's fairly important
>>> in terms of understanding the code and also enabling some BUG
>>> checks (when addr_limit == 0).
>>>
>>> I have not tested these exactly on Florian's test case, but
>>> some tests of my own behave better afterwards. Hopefully he has
>>> time to re-test. Some careful review would be welcome too.
>>
>> I think I have applied the five patches you posted, but I still get a
>> brk value above 128 TiB:
>>
>> # /lib64/ld64.so.1 ./a.out
>> initial brk value: 0x7fffde960000
>> probing at 0x80000001fffc
>>
>> I assumed you wanted to reject those?
> 
> It was difficult to understand what the intended semantics are, but I
> think brk should succeed (it is implemented with MAP_FIXED). Of course
> it should not succeed then segfault when you try to access it.
> 
>>
>> In either case, I recommend to tweak the VM layout, so that ld.so does
>> not land closely to to the 128 TiB limit, so that the brk failure or
>> returning of 48-bit addresses is avoided.
> 
> Yeah well that's yet another issue. I was not really involved with the
> address space extension work. Anees, Kirill, was the intention for the
> 128T->512T extension logic to be a no-op for all address space allocaiton
> except those with explicit addresses?
> 

yes.

-aneesh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-06 11:05       ` Aneesh Kumar K.V
  2017-11-06 11:21         ` Nicholas Piggin
@ 2017-11-07  2:00         ` Aneesh Kumar K.V
  2017-11-07  2:03           ` Nicholas Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-07  2:00 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Nicholas Piggin, linuxppc-dev



On 11/06/2017 04:35 PM, Aneesh Kumar K.V wrote:
> 
> 
> On 11/06/2017 04:24 PM, Nicholas Piggin wrote:
>> On Mon, 06 Nov 2017 16:08:06 +0530
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>>
>>> Nicholas Piggin <npiggin@gmail.com> writes:
>>>
>>>> When allocating VA space with a hint that crosses 128TB, the SLB 
>>>> addr_limit
>>>> variable is not expanded if addr is not > 128TB, but the slice 
>>>> allocation
>>>> looks at task_size, which is 512TB. This results in slice_check_fit()
>>>> incorrectly succeeding because the slice_count truncates off bit 128 
>>>> of the
>>>> requested mask, so the comparison to the available mask succeeds.
>>>
>>>
>>> But then the mask passed to slice_check_fit() is generated using
>>> context.addr_limit as max value. So how did that return succcess? ie,
>>> we get the request mask via
>>>
>>> slice_range_to_mask(addr, len, &mask);
>>>
>>> And the potential/possible mask using
>>>
>>> slice_mask_for_size(mm, psize, &good_mask);
>>>
>>> So how did slice_check_fit() return sucess with
>>>
>>> slice_check_fit(mm, mask, good_mask);
>>
>> Because the addr_limit check is used to *limit* the comparison.
>>
>> The available mask had bit up to 127 set, and the mask had 127 and
>> 128 set. However the 128T addr_limit causes only bits 0-127 to be
>> compared.
>>
> 
> Should we fix it then via ? I haven't tested this yet. Also this result 
> in us comparing more bits?
> 
> modified   arch/powerpc/mm/slice.c
> @@ -169,13 +169,12 @@ static int slice_check_fit(struct mm_struct *mm,
>                  struct slice_mask mask, struct slice_mask available)
>   {
>       DECLARE_BITMAP(result, SLICE_NUM_HIGH);
> -    unsigned long slice_count = 
> GET_HIGH_SLICE_INDEX(mm->context.addr_limit);
> 
>       bitmap_and(result, mask.high_slices,
> -           available.high_slices, slice_count);
> +           available.high_slices, SLICE_NUM_HIGH);
> 
>       return (mask.low_slices & available.low_slices) == mask.low_slices &&
> -        bitmap_equal(result, mask.high_slices, slice_count);
> +        bitmap_equal(result, mask.high_slices, SLICE_NUM_HIGH)
> 
> 

Florian, will you be able to test this patch ? We may not really want to 
push this. But it will confirm that we end up getting >128TB address 
because of this.

-aneesh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  2017-11-07  2:00         ` Aneesh Kumar K.V
@ 2017-11-07  2:03           ` Nicholas Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-07  2:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Florian Weimer, linuxppc-dev

On Tue, 7 Nov 2017 07:30:51 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> On 11/06/2017 04:35 PM, Aneesh Kumar K.V wrote:
> > 
> > 
> > On 11/06/2017 04:24 PM, Nicholas Piggin wrote:  
> >> On Mon, 06 Nov 2017 16:08:06 +0530
> >> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> >>  
> >>> Nicholas Piggin <npiggin@gmail.com> writes:
> >>>  
> >>>> When allocating VA space with a hint that crosses 128TB, the SLB 
> >>>> addr_limit
> >>>> variable is not expanded if addr is not > 128TB, but the slice 
> >>>> allocation
> >>>> looks at task_size, which is 512TB. This results in slice_check_fit()
> >>>> incorrectly succeeding because the slice_count truncates off bit 128 
> >>>> of the
> >>>> requested mask, so the comparison to the available mask succeeds.  
> >>>
> >>>
> >>> But then the mask passed to slice_check_fit() is generated using
> >>> context.addr_limit as max value. So how did that return succcess? ie,
> >>> we get the request mask via
> >>>
> >>> slice_range_to_mask(addr, len, &mask);
> >>>
> >>> And the potential/possible mask using
> >>>
> >>> slice_mask_for_size(mm, psize, &good_mask);
> >>>
> >>> So how did slice_check_fit() return sucess with
> >>>
> >>> slice_check_fit(mm, mask, good_mask);  
> >>
> >> Because the addr_limit check is used to *limit* the comparison.
> >>
> >> The available mask had bit up to 127 set, and the mask had 127 and
> >> 128 set. However the 128T addr_limit causes only bits 0-127 to be
> >> compared.
> >>  
> > 
> > Should we fix it then via ? I haven't tested this yet. Also this result 
> > in us comparing more bits?
> > 
> > modified   arch/powerpc/mm/slice.c
> > @@ -169,13 +169,12 @@ static int slice_check_fit(struct mm_struct *mm,
> >                  struct slice_mask mask, struct slice_mask available)
> >   {
> >       DECLARE_BITMAP(result, SLICE_NUM_HIGH);
> > -    unsigned long slice_count = 
> > GET_HIGH_SLICE_INDEX(mm->context.addr_limit);
> > 
> >       bitmap_and(result, mask.high_slices,
> > -           available.high_slices, slice_count);
> > +           available.high_slices, SLICE_NUM_HIGH);
> > 
> >       return (mask.low_slices & available.low_slices) == mask.low_slices &&
> > -        bitmap_equal(result, mask.high_slices, slice_count);
> > +        bitmap_equal(result, mask.high_slices, SLICE_NUM_HIGH)
> > 
> >   
> 
> Florian, will you be able to test this patch ? We may not really want to 
> push this. But it will confirm that we end up getting >128TB address 
> because of this.

Oh we are, I went through and traced it, and this is the reason hash's
get_unmapped_area gives out > 128TB addresses if addr + len crosses
128TB.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
  2017-11-06 11:55     ` Nicholas Piggin
@ 2017-11-07  2:28       ` Michael Ellerman
  2017-11-07  2:52         ` Nicholas Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2017-11-07  2:28 UTC (permalink / raw)
  To: Nicholas Piggin, Aneesh Kumar K.V
  Cc: Florian Weimer, linuxppc-dev, Kirill A. Shutemov

Nicholas Piggin <npiggin@gmail.com> writes:

> On Mon, 06 Nov 2017 16:14:10 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
>
>> Nicholas Piggin <npiggin@gmail.com> writes:
>> 
>> > While mapping hints with a length that cross 128TB are disallowed,
>> > MAP_FIXED allocations that cross 128TB are allowed. These are failing
>> > on hash (on radix they succeed). Add an additional case for fixed
>> > mappings to expand the addr_limit when crossing 128TB.  
>> 
>> Shouldn't that be fixed in radix. But i see x86 also doing this?
>> 
>> 
>> 	if (flags & MAP_FIXED)
>> 		return addr;
>> 
>> Kiril,
>> 
>> Is that expected?
>
> I should actually reply to this one because the other did not
> have Kirill on cc.
>
> Generic mapping code appears it will always succeed when given an
> explicit hint request, even if the address is below the boundary
> and address + length is above it. Even when !MAP_FIXED. This is the
> sane behaviour AFAIKS.

It's "sane" if you want the 128T boundary to be invisible.

But the whole goddamn point was that we wanted apps to have to opt-in to
using > 128T, and having brk accidentally go over the 128T boundary does
not count as opting in.

So it's a mess as usual.

> So we should switch powerpc to match, shouldn't we?

Yes we should do whatever the other arches do. 

Actually we should do what the man page describes .. except it doesn't.

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
  2017-11-07  2:28       ` Michael Ellerman
@ 2017-11-07  2:52         ` Nicholas Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-07  2:52 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Aneesh Kumar K.V, Florian Weimer, linuxppc-dev, Kirill A. Shutemov

On Tue, 07 Nov 2017 13:28:08 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > On Mon, 06 Nov 2017 16:14:10 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> >  
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>   
> >> > While mapping hints with a length that cross 128TB are disallowed,
> >> > MAP_FIXED allocations that cross 128TB are allowed. These are failing
> >> > on hash (on radix they succeed). Add an additional case for fixed
> >> > mappings to expand the addr_limit when crossing 128TB.    
> >> 
> >> Shouldn't that be fixed in radix. But i see x86 also doing this?
> >> 
> >> 
> >> 	if (flags & MAP_FIXED)
> >> 		return addr;
> >> 
> >> Kiril,
> >> 
> >> Is that expected?  
> >
> > I should actually reply to this one because the other did not
> > have Kirill on cc.
> >
> > Generic mapping code appears it will always succeed when given an
> > explicit hint request, even if the address is below the boundary
> > and address + length is above it. Even when !MAP_FIXED. This is the
> > sane behaviour AFAIKS.  
> 
> It's "sane" if you want the 128T boundary to be invisible.

Not invisible: mmap(NULL, length, ...) + length < 128TB

> But the whole goddamn point was that we wanted apps to have to opt-in to
> using > 128T, and having brk accidentally go over the 128T boundary does
> not count as opting in.

If brk() is given an explicit address > 128TB, then that's opting
in, right?

If malloc is doing some allocation scheme which "opts in", such as
using explicit brk or mmap, then there is no way for the kernel to
solve that. Making increasingly convoluted policy to try to allow
the good and reject the bad is not right IMO.

> So it's a mess as usual.

If we wanted to make this ABI-identical with existing systems, then
it can't work without out-of-band opt-in. Why wasn't a personality
+ sysctl used for this, as we apparently have for a previous address
space layout change?

> 
> > So we should switch powerpc to match, shouldn't we?  
> 
> Yes we should do whatever the other arches do. 
> 
> Actually we should do what the man page describes .. except it doesn't.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-11-07  2:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-06 10:03 [PATCH 0/5] VA allocator fixes Nicholas Piggin
2017-11-06 10:03 ` [PATCH 1/5] powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
2017-11-06 10:38   ` Aneesh Kumar K.V
2017-11-06 10:54     ` Nicholas Piggin
2017-11-06 11:05       ` Aneesh Kumar K.V
2017-11-06 11:21         ` Nicholas Piggin
2017-11-07  2:00         ` Aneesh Kumar K.V
2017-11-07  2:03           ` Nicholas Piggin
2017-11-06 10:03 ` [PATCH 2/5] powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary Nicholas Piggin
2017-11-06 10:44   ` Aneesh Kumar K.V
2017-11-06 11:55     ` Nicholas Piggin
2017-11-07  2:28       ` Michael Ellerman
2017-11-07  2:52         ` Nicholas Piggin
2017-11-06 10:03 ` [PATCH 3/5] powerpc/64s/hash: Fix fork() with 512TB process address space Nicholas Piggin
2017-11-06 10:44   ` Aneesh Kumar K.V
2017-11-06 10:03 ` [PATCH 4/5] powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation Nicholas Piggin
2017-11-06 11:14   ` Aneesh Kumar K.V
2017-11-06 11:42     ` Nicholas Piggin
2017-11-06 10:03 ` [PATCH 5/5] powerpc/64s: mm_context.addr_limit is only used on hash Nicholas Piggin
2017-11-06 15:16 ` [PATCH 0/5] VA allocator fixes Florian Weimer
2017-11-07  0:06   ` Nicholas Piggin
2017-11-07  1:59     ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.