All of lore.kernel.org
 help / color / mirror / Atom feed
* POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-03 17:05 ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-03 17:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-mm

We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
variant, but I have not seen it on our own builders) where running 
localedef as part of the glibc build crashes with a segmentation fault.

Kernel version is 4.13.9 (Fedora 26 variant).

I have only seen this with an explicit loader invocation, like this:

while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
--alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
charmaps/UTF-8 
--prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
; done

To be run in the localedata subdirectory of a glibc *source* tree, after 
a build.  You may have to create the 
/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
directory.  I have only reproduced this inside a Fedora 27 chroot on a 
Fedora 26 host, but there it does not matter if you run the old (chroot) 
or newly built binary.

I filed this as a glibc bug for tracking:

   https://sourceware.org/bugzilla/show_bug.cgi?id=22390

There's an strace log and a coredump from the crash.

I think the data shows that the address in question should be writable.

The crossed 0x0000800000000000 binary is very suggestive.  I think that 
based on the operation of glibc's malloc, this write would be the first 
time this happens during the lifetime of the process.

Does that ring any bells?  Is there anything I can do to provide more 
data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
any diagnostics tool which is provided by Fedora.

I can try to come up with a better reproducer, but that appears to be 
difficult.

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-03 17:05 ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-03 17:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-mm

We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
variant, but I have not seen it on our own builders) where running 
localedef as part of the glibc build crashes with a segmentation fault.

Kernel version is 4.13.9 (Fedora 26 variant).

I have only seen this with an explicit loader invocation, like this:

while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
--alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
charmaps/UTF-8 
--prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
; done

To be run in the localedata subdirectory of a glibc *source* tree, after 
a build.  You may have to create the 
/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
directory.  I have only reproduced this inside a Fedora 27 chroot on a 
Fedora 26 host, but there it does not matter if you run the old (chroot) 
or newly built binary.

I filed this as a glibc bug for tracking:

   https://sourceware.org/bugzilla/show_bug.cgi?id=22390

There's an strace log and a coredump from the crash.

I think the data shows that the address in question should be writable.

The crossed 0x0000800000000000 binary is very suggestive.  I think that 
based on the operation of glibc's malloc, this write would be the first 
time this happens during the lifetime of the process.

Does that ring any bells?  Is there anything I can do to provide more 
data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
any diagnostics tool which is provided by Fedora.

I can try to come up with a better reproducer, but that appears to be 
difficult.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-03 17:05 ` Florian Weimer
@ 2017-11-05 12:18   ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-05 12:18 UTC (permalink / raw)
  To: Florian Weimer, Aneesh Kumar K . V; +Cc: linuxppc-dev, linux-mm

On Fri, 3 Nov 2017 18:05:20 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
> variant, but I have not seen it on our own builders) where running 
> localedef as part of the glibc build crashes with a segmentation fault.
> 
> Kernel version is 4.13.9 (Fedora 26 variant).
> 
> I have only seen this with an explicit loader invocation, like this:
> 
> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
> charmaps/UTF-8 
> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
> ; done
> 
> To be run in the localedata subdirectory of a glibc *source* tree, after 
> a build.  You may have to create the 
> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
> Fedora 26 host, but there it does not matter if you run the old (chroot) 
> or newly built binary.
> 
> I filed this as a glibc bug for tracking:
> 
>    https://sourceware.org/bugzilla/show_bug.cgi?id=22390
> 
> There's an strace log and a coredump from the crash.
> 
> I think the data shows that the address in question should be writable.
> 
> The crossed 0x0000800000000000 binary is very suggestive.  I think that 
> based on the operation of glibc's malloc, this write would be the first 
> time this happens during the lifetime of the process.
> 
> Does that ring any bells?  Is there anything I can do to provide more 
> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
> any diagnostics tool which is provided by Fedora.

There was a recent change to move to 128TB address space by default,
and option for 512TB addresses if explicitly requested.

Your brk request asked for > 128TB which the kernel gave it, but the
address limit in the paca that the SLB miss tests against was not
updated to reflect the switch to 512TB address space.

Why is your brk starting so high? Are you trying to test the > 128TB
case, or maybe something is confused by the 64->128TB change? What's
the strace look like if you run on a distro or <= 4.10 kernel?

Something like the following patch may help if you could test.

Thanks,
Nick

---
 arch/powerpc/mm/hugetlbpage-radix.c    | 18 ++++++++++++++----
 arch/powerpc/mm/mmap.c                 | 34 +++++++++++++++++++++++++---------
 arch/powerpc/mm/mmu_context_book3s64.c | 14 +++++++-------
 arch/powerpc/mm/slice.c                | 28 +++++++++++++++++++---------
 4 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage-radix.c b/arch/powerpc/mm/hugetlbpage-radix.c
index a12e86395025..44e1109765b5 100644
--- a/arch/powerpc/mm/hugetlbpage-radix.c
+++ b/arch/powerpc/mm/hugetlbpage-radix.c
@@ -50,8 +50,16 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
+	/*
+	 * If address is specified explicitly and crosses addr_limit, or if
+	 * address is unspecified but len is greater than addr_limit, then
+	 * expand out to TASK_SIZE.
+	 */
+	if (unlikely(addr + len >= mm->context.addr_limit)) {
+		if ((!addr || addr + len > mm->context.addr_limit) &&
+				mm->context.addr_limit != TASK_SIZE)
+			mm->context.addr_limit = TASK_SIZE;
+	}
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
@@ -82,8 +90,10 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 
-	if (addr > DEFAULT_MAP_WINDOW)
-		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	if (addr + len >= DEFAULT_MAP_WINDOW) {
+		if (!addr || addr + len > DEFAULT_MAP_WINDOW)
+			info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	}
 
 	return vm_unmapped_area(&info);
 }
diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index 5d78b193fec4..a8fe1eaf1d96 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -108,9 +108,16 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_area_struct *vma;
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit &&
-		     mm->context.addr_limit != TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
+	/*
+	 * If address is specified explicitly and crosses addr_limit, or if
+	 * address is unspecified but len is greater than addr_limit, then
+	 * expand out to TASK_SIZE.
+	 */
+	if (unlikely(addr + len >= mm->context.addr_limit)) {
+		if ((!addr || addr + len > mm->context.addr_limit) &&
+				mm->context.addr_limit != TASK_SIZE)
+			mm->context.addr_limit = TASK_SIZE;
+	}
 
 	if (len > mm->task_size - mmap_min_addr)
 		return -ENOMEM;
@@ -131,7 +138,7 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.low_limit = mm->mmap_base;
 	info.align_mask = 0;
 
-	if (unlikely(addr > DEFAULT_MAP_WINDOW))
+	if (unlikely(addr + len > DEFAULT_MAP_WINDOW))
 		info.high_limit = mm->context.addr_limit;
 	else
 		info.high_limit = DEFAULT_MAP_WINDOW;
@@ -151,9 +158,16 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit &&
-		     mm->context.addr_limit != TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
+	/*
+	 * If address is specified explicitly and crosses addr_limit, or if
+	 * address is unspecified but len is greater than addr_limit, then
+	 * expand out to TASK_SIZE.
+	 */
+	if (unlikely(addr + len >= mm->context.addr_limit)) {
+		if ((!addr || addr + len > mm->context.addr_limit) &&
+				mm->context.addr_limit != TASK_SIZE)
+			mm->context.addr_limit = TASK_SIZE;
+	}
 
 	/* requested length too big for entire address space */
 	if (len > mm->task_size - mmap_min_addr)
@@ -177,8 +191,10 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	info.high_limit = mm->mmap_base;
 	info.align_mask = 0;
 
-	if (addr > DEFAULT_MAP_WINDOW)
-		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	if (addr + len >= DEFAULT_MAP_WINDOW) {
+		if (!addr || addr + len > DEFAULT_MAP_WINDOW)
+			info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	}
 
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index 05e15386d4cb..1116ea0ddb2e 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -92,13 +92,6 @@ static int hash__init_new_context(struct mm_struct *mm)
 	if (index < 0)
 		return index;
 
-	/*
-	 * We do switch_slb() early in fork, even before we setup the
-	 * mm->context.addr_limit. Default to max task size so that we copy the
-	 * default values to paca which will help us to handle slb miss early.
-	 */
-	mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
-
 	/*
 	 * The old code would re-promote on fork, we don't do that when using
 	 * slices as it could cause problem promoting slices that have been
@@ -162,6 +155,13 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	if (index < 0)
 		return index;
 
+	/*
+	 * In the case of exec, use the default limit,
+	 * otherwise inherit it from the mm we are duplicating.
+	 */
+	if (!mm->context.addr_limit)
+		mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
+
 	mm->context.id = index;
 
 #ifdef CONFIG_PPC_64K_PAGES
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 45f6740dd407..aa55523b6759 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -329,7 +329,7 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm,
 	 * Only for that request for which high_limit is above
 	 * DEFAULT_MAP_WINDOW we should apply this.
 	 */
-	if (high_limit  > DEFAULT_MAP_WINDOW)
+	if (high_limit > DEFAULT_MAP_WINDOW)
 		addr += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
 
 	while (addr > PAGE_SIZE) {
@@ -418,19 +418,29 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 
 	/*
 	 * Check if we need to expland slice area.
+	 *
+	 * If address is specified explicitly and crosses addr_limit, or if
+	 * address is unspecified but len is greater than addr_limit, then
+	 * expand out to TASK_SIZE.
 	 */
-	if (unlikely(addr > mm->context.addr_limit &&
-		     mm->context.addr_limit != TASK_SIZE)) {
-		mm->context.addr_limit = TASK_SIZE;
-		on_each_cpu(slice_flush_segments, mm, 1);
+	if (unlikely(addr + len >= mm->context.addr_limit)) {
+		if ((!addr || addr + len > mm->context.addr_limit) &&
+				mm->context.addr_limit != TASK_SIZE) {
+			mm->context.addr_limit = TASK_SIZE;
+			on_each_cpu(slice_flush_segments, mm, 1);
+		}
 	}
+
 	/*
 	 * This mmap request can allocate upt to 512TB
 	 */
-	if (addr > DEFAULT_MAP_WINDOW)
-		high_limit = mm->context.addr_limit;
-	else
-		high_limit = DEFAULT_MAP_WINDOW;
+	high_limit = DEFAULT_MAP_WINDOW;
+	if (addr + len >= DEFAULT_MAP_WINDOW) {
+		if (!addr || addr + len > DEFAULT_MAP_WINDOW)
+			high_limit = mm->context.addr_limit;
+	}
+
+
 	/*
 	 * init different masks
 	 */
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-05 12:18   ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-05 12:18 UTC (permalink / raw)
  To: Florian Weimer, Aneesh Kumar K . V; +Cc: linuxppc-dev, linux-mm

On Fri, 3 Nov 2017 18:05:20 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
> variant, but I have not seen it on our own builders) where running 
> localedef as part of the glibc build crashes with a segmentation fault.
> 
> Kernel version is 4.13.9 (Fedora 26 variant).
> 
> I have only seen this with an explicit loader invocation, like this:
> 
> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
> charmaps/UTF-8 
> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
> ; done
> 
> To be run in the localedata subdirectory of a glibc *source* tree, after 
> a build.  You may have to create the 
> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
> Fedora 26 host, but there it does not matter if you run the old (chroot) 
> or newly built binary.
> 
> I filed this as a glibc bug for tracking:
> 
>    https://sourceware.org/bugzilla/show_bug.cgi?id=22390
> 
> There's an strace log and a coredump from the crash.
> 
> I think the data shows that the address in question should be writable.
> 
> The crossed 0x0000800000000000 binary is very suggestive.  I think that 
> based on the operation of glibc's malloc, this write would be the first 
> time this happens during the lifetime of the process.
> 
> Does that ring any bells?  Is there anything I can do to provide more 
> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
> any diagnostics tool which is provided by Fedora.

There was a recent change to move to 128TB address space by default,
and option for 512TB addresses if explicitly requested.

Your brk request asked for > 128TB which the kernel gave it, but the
address limit in the paca that the SLB miss tests against was not
updated to reflect the switch to 512TB address space.

Why is your brk starting so high? Are you trying to test the > 128TB
case, or maybe something is confused by the 64->128TB change? What's
the strace look like if you run on a distro or <= 4.10 kernel?

Something like the following patch may help if you could test.

Thanks,
Nick

---
 arch/powerpc/mm/hugetlbpage-radix.c    | 18 ++++++++++++++----
 arch/powerpc/mm/mmap.c                 | 34 +++++++++++++++++++++++++---------
 arch/powerpc/mm/mmu_context_book3s64.c | 14 +++++++-------
 arch/powerpc/mm/slice.c                | 28 +++++++++++++++++++---------
 4 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage-radix.c b/arch/powerpc/mm/hugetlbpage-radix.c
index a12e86395025..44e1109765b5 100644
--- a/arch/powerpc/mm/hugetlbpage-radix.c
+++ b/arch/powerpc/mm/hugetlbpage-radix.c
@@ -50,8 +50,16 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	struct hstate *h = hstate_file(file);
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
+	/*
+	 * If address is specified explicitly and crosses addr_limit, or if
+	 * address is unspecified but len is greater than addr_limit, then
+	 * expand out to TASK_SIZE.
+	 */
+	if (unlikely(addr + len >= mm->context.addr_limit)) {
+		if ((!addr || addr + len > mm->context.addr_limit) &&
+				mm->context.addr_limit != TASK_SIZE)
+			mm->context.addr_limit = TASK_SIZE;
+	}
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
@@ -82,8 +90,10 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
 	info.align_offset = 0;
 
-	if (addr > DEFAULT_MAP_WINDOW)
-		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	if (addr + len >= DEFAULT_MAP_WINDOW) {
+		if (!addr || addr + len > DEFAULT_MAP_WINDOW)
+			info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	}
 
 	return vm_unmapped_area(&info);
 }
diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index 5d78b193fec4..a8fe1eaf1d96 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -108,9 +108,16 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	struct vm_area_struct *vma;
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit &&
-		     mm->context.addr_limit != TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
+	/*
+	 * If address is specified explicitly and crosses addr_limit, or if
+	 * address is unspecified but len is greater than addr_limit, then
+	 * expand out to TASK_SIZE.
+	 */
+	if (unlikely(addr + len >= mm->context.addr_limit)) {
+		if ((!addr || addr + len > mm->context.addr_limit) &&
+				mm->context.addr_limit != TASK_SIZE)
+			mm->context.addr_limit = TASK_SIZE;
+	}
 
 	if (len > mm->task_size - mmap_min_addr)
 		return -ENOMEM;
@@ -131,7 +138,7 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.low_limit = mm->mmap_base;
 	info.align_mask = 0;
 
-	if (unlikely(addr > DEFAULT_MAP_WINDOW))
+	if (unlikely(addr + len > DEFAULT_MAP_WINDOW))
 		info.high_limit = mm->context.addr_limit;
 	else
 		info.high_limit = DEFAULT_MAP_WINDOW;
@@ -151,9 +158,16 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	unsigned long addr = addr0;
 	struct vm_unmapped_area_info info;
 
-	if (unlikely(addr > mm->context.addr_limit &&
-		     mm->context.addr_limit != TASK_SIZE))
-		mm->context.addr_limit = TASK_SIZE;
+	/*
+	 * If address is specified explicitly and crosses addr_limit, or if
+	 * address is unspecified but len is greater than addr_limit, then
+	 * expand out to TASK_SIZE.
+	 */
+	if (unlikely(addr + len >= mm->context.addr_limit)) {
+		if ((!addr || addr + len > mm->context.addr_limit) &&
+				mm->context.addr_limit != TASK_SIZE)
+			mm->context.addr_limit = TASK_SIZE;
+	}
 
 	/* requested length too big for entire address space */
 	if (len > mm->task_size - mmap_min_addr)
@@ -177,8 +191,10 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
 	info.high_limit = mm->mmap_base;
 	info.align_mask = 0;
 
-	if (addr > DEFAULT_MAP_WINDOW)
-		info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	if (addr + len >= DEFAULT_MAP_WINDOW) {
+		if (!addr || addr + len > DEFAULT_MAP_WINDOW)
+			info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+	}
 
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index 05e15386d4cb..1116ea0ddb2e 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -92,13 +92,6 @@ static int hash__init_new_context(struct mm_struct *mm)
 	if (index < 0)
 		return index;
 
-	/*
-	 * We do switch_slb() early in fork, even before we setup the
-	 * mm->context.addr_limit. Default to max task size so that we copy the
-	 * default values to paca which will help us to handle slb miss early.
-	 */
-	mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
-
 	/*
 	 * The old code would re-promote on fork, we don't do that when using
 	 * slices as it could cause problem promoting slices that have been
@@ -162,6 +155,13 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	if (index < 0)
 		return index;
 
+	/*
+	 * In the case of exec, use the default limit,
+	 * otherwise inherit it from the mm we are duplicating.
+	 */
+	if (!mm->context.addr_limit)
+		mm->context.addr_limit = DEFAULT_MAP_WINDOW_USER64;
+
 	mm->context.id = index;
 
 #ifdef CONFIG_PPC_64K_PAGES
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 45f6740dd407..aa55523b6759 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -329,7 +329,7 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm,
 	 * Only for that request for which high_limit is above
 	 * DEFAULT_MAP_WINDOW we should apply this.
 	 */
-	if (high_limit  > DEFAULT_MAP_WINDOW)
+	if (high_limit > DEFAULT_MAP_WINDOW)
 		addr += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
 
 	while (addr > PAGE_SIZE) {
@@ -418,19 +418,29 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
 
 	/*
 	 * Check if we need to expland slice area.
+	 *
+	 * If address is specified explicitly and crosses addr_limit, or if
+	 * address is unspecified but len is greater than addr_limit, then
+	 * expand out to TASK_SIZE.
 	 */
-	if (unlikely(addr > mm->context.addr_limit &&
-		     mm->context.addr_limit != TASK_SIZE)) {
-		mm->context.addr_limit = TASK_SIZE;
-		on_each_cpu(slice_flush_segments, mm, 1);
+	if (unlikely(addr + len >= mm->context.addr_limit)) {
+		if ((!addr || addr + len > mm->context.addr_limit) &&
+				mm->context.addr_limit != TASK_SIZE) {
+			mm->context.addr_limit = TASK_SIZE;
+			on_each_cpu(slice_flush_segments, mm, 1);
+		}
 	}
+
 	/*
 	 * This mmap request can allocate upt to 512TB
 	 */
-	if (addr > DEFAULT_MAP_WINDOW)
-		high_limit = mm->context.addr_limit;
-	else
-		high_limit = DEFAULT_MAP_WINDOW;
+	high_limit = DEFAULT_MAP_WINDOW;
+	if (addr + len >= DEFAULT_MAP_WINDOW) {
+		if (!addr || addr + len > DEFAULT_MAP_WINDOW)
+			high_limit = mm->context.addr_limit;
+	}
+
+
 	/*
 	 * init different masks
 	 */
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-05 12:18   ` Nicholas Piggin
  (?)
@ 2017-11-05 12:35   ` Florian Weimer
  2017-11-05 12:54       ` Nicholas Piggin
  -1 siblings, 1 reply; 71+ messages in thread
From: Florian Weimer @ 2017-11-05 12:35 UTC (permalink / raw)
  To: Nicholas Piggin, Aneesh Kumar K . V; +Cc: linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1336 bytes --]

On 11/05/2017 01:18 PM, Nicholas Piggin wrote:

> There was a recent change to move to 128TB address space by default,
> and option for 512TB addresses if explicitly requested.

Do you have a commit hash for the introduction of 128TB by default?  Thanks.

> Your brk request asked for > 128TB which the kernel gave it, but the
> address limit in the paca that the SLB miss tests against was not
> updated to reflect the switch to 512TB address space.
> 
> Why is your brk starting so high? Are you trying to test the > 128TB
> case, or maybe something is confused by the 64->128TB change? What's
> the strace look like if you run on a distro or <= 4.10 kernel?

I think it is a consequence of running with an explicit loader 
invocation.  With that, the heap is placed above ld.so, which can be 
quite high in the address space.

I'm attaching two runs of cat, one executing directly as /bin/cat, and 
one with /lib64/ld64.so.1 /bin/cat.

Fortunately, this does *not* apply to PIE binaries (also attached). 
However, explicit loader invocations are sometimes used in test suites 
(not just for glibc), and these sporadic test failures are quite annoying.

Do you still need the strace log?  And if yes, of what exactly?

> Something like the following patch may help if you could test.

Okay, this will take some time.

Thanks,
Florian

[-- Attachment #2: pie.txt --]
[-- Type: text/plain, Size: 1197 bytes --]

1231d0000-1231e0000 r-xp 00000000 fd:00 17852425                         /root/a.out
1231e0000-1231f0000 r--p 00000000 fd:00 17852425                         /root/a.out
1231f0000-123200000 rw-p 00010000 fd:00 17852425                         /root/a.out
1000dbc0000-1000dbf0000 rw-p 00000000 00:00 0                            [heap]
7fffa31d0000-7fffa3400000 r-xp 00000000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fffa3400000-7fffa3410000 ---p 00230000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fffa3410000-7fffa3420000 r--p 00230000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fffa3420000-7fffa3430000 rw-p 00240000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fffa3440000-7fffa3460000 r-xp 00000000 00:00 0                          [vdso]
7fffa3460000-7fffa34a0000 r-xp 00000000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fffa34a0000-7fffa34b0000 r--p 00030000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fffa34b0000-7fffa34c0000 rw-p 00040000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fffe9450000-7fffe9480000 rw-p 00000000 00:00 0                          [stack]

[-- Attachment #3: explicit-ldso.txt --]
[-- Type: text/plain, Size: 2712 bytes --]

7fff7e790000-7fff7e7d0000 rw-p 00000000 00:00 0 
7fff7e7d0000-7fff7e830000 r--p 00000000 fd:00 25167925                   /usr/lib/locale/en_US.utf8/LC_CTYPE
7fff7e830000-7fff7e840000 r--p 00000000 fd:00 25167928                   /usr/lib/locale/en_US.utf8/LC_NUMERIC
7fff7e840000-7fff7e850000 r--p 00000000 fd:00 16798929                   /usr/lib/locale/en_US.utf8/LC_TIME
7fff7e850000-7fff7e980000 r--p 00000000 fd:00 25167924                   /usr/lib/locale/en_US.utf8/LC_COLLATE
7fff7e980000-7fff7e990000 r--p 00000000 fd:00 16798927                   /usr/lib/locale/en_US.utf8/LC_MONETARY
7fff7e990000-7fff7e9a0000 r--p 00000000 fd:00 2511                       /usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7fff7e9a0000-7fff7e9b0000 r--p 00000000 fd:00 16798942                   /usr/lib/locale/en_US.utf8/LC_PAPER
7fff7e9b0000-7fff7e9c0000 r--p 00000000 fd:00 25167927                   /usr/lib/locale/en_US.utf8/LC_NAME
7fff7e9c0000-7fff7e9d0000 r--p 00000000 fd:00 16798924                   /usr/lib/locale/en_US.utf8/LC_ADDRESS
7fff7e9d0000-7fff7e9e0000 r--p 00000000 fd:00 16798928                   /usr/lib/locale/en_US.utf8/LC_TELEPHONE
7fff7e9e0000-7fff7e9f0000 r--p 00000000 fd:00 16798926                   /usr/lib/locale/en_US.utf8/LC_MEASUREMENT
7fff7e9f0000-7fff7ea00000 r--s 00000000 fd:00 8390669                    /usr/lib64/gconv/gconv-modules.cache
7fff7ea00000-7fff7ec30000 r-xp 00000000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fff7ec30000-7fff7ec40000 ---p 00230000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fff7ec40000-7fff7ec50000 r--p 00230000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fff7ec50000-7fff7ec60000 rw-p 00240000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fff7ec60000-7fff7ec70000 r--p 00000000 fd:00 16798925                   /usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
7fff7ec70000-7fff7ec80000 r-xp 00000000 fd:00 202293                     /usr/bin/cat
7fff7ec80000-7fff7ec90000 r--p 00000000 fd:00 202293                     /usr/bin/cat
7fff7ec90000-7fff7eca0000 rw-p 00010000 fd:00 202293                     /usr/bin/cat
7fff7eca0000-7fff7ecc0000 r-xp 00000000 00:00 0                          [vdso]
7fff7ecc0000-7fff7ed00000 r-xp 00000000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fff7ed00000-7fff7ed10000 r--p 00030000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fff7ed10000-7fff7ed20000 rw-p 00040000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fff9be20000-7fff9be50000 rw-p 00000000 00:00 0                          [heap]
7ffffd470000-7ffffd4a0000 rw-p 00000000 00:00 0                          [stack]

[-- Attachment #4: direct.txt --]
[-- Type: text/plain, Size: 2712 bytes --]

13eea0000-13eeb0000 r-xp 00000000 fd:00 202293                           /usr/bin/cat
13eeb0000-13eec0000 r--p 00000000 fd:00 202293                           /usr/bin/cat
13eec0000-13eed0000 rw-p 00010000 fd:00 202293                           /usr/bin/cat
1003ebe0000-1003ec10000 rw-p 00000000 00:00 0                            [heap]
7fff97490000-7fff974d0000 rw-p 00000000 00:00 0 
7fff974d0000-7fff97530000 r--p 00000000 fd:00 25167925                   /usr/lib/locale/en_US.utf8/LC_CTYPE
7fff97530000-7fff97540000 r--p 00000000 fd:00 25167928                   /usr/lib/locale/en_US.utf8/LC_NUMERIC
7fff97540000-7fff97550000 r--p 00000000 fd:00 16798929                   /usr/lib/locale/en_US.utf8/LC_TIME
7fff97550000-7fff97680000 r--p 00000000 fd:00 25167924                   /usr/lib/locale/en_US.utf8/LC_COLLATE
7fff97680000-7fff97690000 r--p 00000000 fd:00 16798927                   /usr/lib/locale/en_US.utf8/LC_MONETARY
7fff97690000-7fff976a0000 r--p 00000000 fd:00 2511                       /usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7fff976a0000-7fff976b0000 r--p 00000000 fd:00 16798942                   /usr/lib/locale/en_US.utf8/LC_PAPER
7fff976b0000-7fff976c0000 r--p 00000000 fd:00 25167927                   /usr/lib/locale/en_US.utf8/LC_NAME
7fff976c0000-7fff976d0000 r--p 00000000 fd:00 16798924                   /usr/lib/locale/en_US.utf8/LC_ADDRESS
7fff976d0000-7fff976e0000 r--p 00000000 fd:00 16798928                   /usr/lib/locale/en_US.utf8/LC_TELEPHONE
7fff976e0000-7fff976f0000 r--p 00000000 fd:00 16798926                   /usr/lib/locale/en_US.utf8/LC_MEASUREMENT
7fff976f0000-7fff97700000 r--s 00000000 fd:00 8390669                    /usr/lib64/gconv/gconv-modules.cache
7fff97700000-7fff97930000 r-xp 00000000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fff97930000-7fff97940000 ---p 00230000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fff97940000-7fff97950000 r--p 00230000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fff97950000-7fff97960000 rw-p 00240000 fd:00 25167936                   /usr/lib64/power8/libc-2.25.so
7fff97960000-7fff97970000 r--p 00000000 fd:00 16798925                   /usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
7fff97970000-7fff97990000 r-xp 00000000 00:00 0                          [vdso]
7fff97990000-7fff979d0000 r-xp 00000000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fff979d0000-7fff979e0000 r--p 00030000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fff979e0000-7fff979f0000 rw-p 00040000 fd:00 8390329                    /usr/lib64/ld-2.25.so
7fffdd110000-7fffdd140000 rw-p 00000000 00:00 0                          [stack]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-05 12:35   ` Florian Weimer
@ 2017-11-05 12:54       ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-05 12:54 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Aneesh Kumar K . V, linuxppc-dev, linux-mm

On Sun, 5 Nov 2017 13:35:40 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/05/2017 01:18 PM, Nicholas Piggin wrote:
> 
> > There was a recent change to move to 128TB address space by default,
> > and option for 512TB addresses if explicitly requested.  
> 
> Do you have a commit hash for the introduction of 128TB by default?  Thanks.

I guess this one

f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")

> 
> > Your brk request asked for > 128TB which the kernel gave it, but the
> > address limit in the paca that the SLB miss tests against was not
> > updated to reflect the switch to 512TB address space.
> > 
> > Why is your brk starting so high? Are you trying to test the > 128TB
> > case, or maybe something is confused by the 64->128TB change? What's
> > the strace look like if you run on a distro or <= 4.10 kernel?  
> 
> I think it is a consequence of running with an explicit loader 
> invocation.  With that, the heap is placed above ld.so, which can be 
> quite high in the address space.
> 
> I'm attaching two runs of cat, one executing directly as /bin/cat, and 
> one with /lib64/ld64.so.1 /bin/cat.
> 
> Fortunately, this does *not* apply to PIE binaries (also attached). 
> However, explicit loader invocations are sometimes used in test suites 
> (not just for glibc), and these sporadic test failures are quite annoying.
> 
> Do you still need the strace log?  And if yes, of what exactly?

Thanks, that should be quite helpful. I'll spend a bit more time to
study it, I'll let you know if I need any other traces.

> 
> > Something like the following patch may help if you could test.  
> 
> Okay, this will take some time.

It's no rush, there will probably be a revision to come.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-05 12:54       ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-05 12:54 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Aneesh Kumar K . V, linuxppc-dev, linux-mm

On Sun, 5 Nov 2017 13:35:40 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/05/2017 01:18 PM, Nicholas Piggin wrote:
> 
> > There was a recent change to move to 128TB address space by default,
> > and option for 512TB addresses if explicitly requested.  
> 
> Do you have a commit hash for the introduction of 128TB by default?  Thanks.

I guess this one

f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")

> 
> > Your brk request asked for > 128TB which the kernel gave it, but the
> > address limit in the paca that the SLB miss tests against was not
> > updated to reflect the switch to 512TB address space.
> > 
> > Why is your brk starting so high? Are you trying to test the > 128TB
> > case, or maybe something is confused by the 64->128TB change? What's
> > the strace look like if you run on a distro or <= 4.10 kernel?  
> 
> I think it is a consequence of running with an explicit loader 
> invocation.  With that, the heap is placed above ld.so, which can be 
> quite high in the address space.
> 
> I'm attaching two runs of cat, one executing directly as /bin/cat, and 
> one with /lib64/ld64.so.1 /bin/cat.
> 
> Fortunately, this does *not* apply to PIE binaries (also attached). 
> However, explicit loader invocations are sometimes used in test suites 
> (not just for glibc), and these sporadic test failures are quite annoying.
> 
> Do you still need the strace log?  And if yes, of what exactly?

Thanks, that should be quite helpful. I'll spend a bit more time to
study it, I'll let you know if I need any other traces.

> 
> > Something like the following patch may help if you could test.  
> 
> Okay, this will take some time.

It's no rush, there will probably be a revision to come.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-05 12:18   ` Nicholas Piggin
@ 2017-11-05 14:50     ` Florian Weimer
  -1 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-05 14:50 UTC (permalink / raw)
  To: Nicholas Piggin, Aneesh Kumar K . V; +Cc: linuxppc-dev, linux-mm

On 11/05/2017 01:18 PM, Nicholas Piggin wrote:
> Something like the following patch may help if you could test.

The patch appears to fix it:

# /lib64/ld64.so.1 ./a.out
initial brk value: 0x7fffe4590000
probing at 0x80000001fffc

I used the follow simplified reproducer:

#include <err.h>
#include <unistd.h>
#include <inttypes.h>
#include <errno.h>
#include <stdio.h>

int
main (void)
{
   errno = 0;
   void *p = sbrk (0);
   if (errno != 0)
     err (1, "sbrk (0)");
   printf ("initial brk value: %p\n", p);
   unsigned long long target = 0x800000020000ULL;
   if ((uintptr_t) p >= target)
     errx (1, "initial brk value is already above target");
   unsigned long long increment = target - (uintptr_t) p;
   errno = 0;
   sbrk (increment);
   if (errno != 0)
     err (1, "sbrk (0x%llx)", increment);
   volatile int *pi = (volatile int *) (target - 4);
   printf ("probing at %p\n", pi);
   *pi = 1;
}


It is still probabilistic because if the increment is too large, the 
second sbrk call will fail with an out of memory error (which is 
expected), so you'll have to run it a couple of times.

If the test fails, the write at the will segfault.

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-05 14:50     ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-05 14:50 UTC (permalink / raw)
  To: Nicholas Piggin, Aneesh Kumar K . V; +Cc: linuxppc-dev, linux-mm

On 11/05/2017 01:18 PM, Nicholas Piggin wrote:
> Something like the following patch may help if you could test.

The patch appears to fix it:

# /lib64/ld64.so.1 ./a.out
initial brk value: 0x7fffe4590000
probing at 0x80000001fffc

I used the follow simplified reproducer:

#include <err.h>
#include <unistd.h>
#include <inttypes.h>
#include <errno.h>
#include <stdio.h>

int
main (void)
{
   errno = 0;
   void *p = sbrk (0);
   if (errno != 0)
     err (1, "sbrk (0)");
   printf ("initial brk value: %p\n", p);
   unsigned long long target = 0x800000020000ULL;
   if ((uintptr_t) p >= target)
     errx (1, "initial brk value is already above target");
   unsigned long long increment = target - (uintptr_t) p;
   errno = 0;
   sbrk (increment);
   if (errno != 0)
     err (1, "sbrk (0x%llx)", increment);
   volatile int *pi = (volatile int *) (target - 4);
   printf ("probing at %p\n", pi);
   *pi = 1;
}


It is still probabilistic because if the increment is too large, the 
second sbrk call will fail with an out of memory error (which is 
expected), so you'll have to run it a couple of times.

If the test fails, the write at the will segfault.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-05 12:18   ` Nicholas Piggin
@ 2017-11-06  6:18     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 71+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06  6:18 UTC (permalink / raw)
  To: Nicholas Piggin, Florian Weimer, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm

Nicholas Piggin <npiggin@gmail.com> writes:

> On Fri, 3 Nov 2017 18:05:20 +0100
> Florian Weimer <fweimer@redhat.com> wrote:
>
>> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
>> variant, but I have not seen it on our own builders) where running 
>> localedef as part of the glibc build crashes with a segmentation fault.
>> 
>> Kernel version is 4.13.9 (Fedora 26 variant).
>> 
>> I have only seen this with an explicit loader invocation, like this:
>> 
>> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
>> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
>> charmaps/UTF-8 
>> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
>> ; done
>> 
>> To be run in the localedata subdirectory of a glibc *source* tree, after 
>> a build.  You may have to create the 
>> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
>> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
>> Fedora 26 host, but there it does not matter if you run the old (chroot) 
>> or newly built binary.
>> 
>> I filed this as a glibc bug for tracking:
>> 
>>    https://sourceware.org/bugzilla/show_bug.cgi?id=22390
>> 
>> There's an strace log and a coredump from the crash.
>> 
>> I think the data shows that the address in question should be writable.
>> 
>> The crossed 0x0000800000000000 binary is very suggestive.  I think that 
>> based on the operation of glibc's malloc, this write would be the first 
>> time this happens during the lifetime of the process.
>> 
>> Does that ring any bells?  Is there anything I can do to provide more 
>> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
>> any diagnostics tool which is provided by Fedora.
>
> There was a recent change to move to 128TB address space by default,
> and option for 512TB addresses if explicitly requested.
>
> Your brk request asked for > 128TB which the kernel gave it, but the
> address limit in the paca that the SLB miss tests against was not
> updated to reflect the switch to 512TB address space.

We should not return that address, unless we requested with a hint value
of > 128TB. IIRC we discussed this early during the mmap interface
change and said, we will return an address > 128T only if the hint
address is above 128TB (not hint addr + length). I am not sure why
we are finding us returning and address > 128TB with paca limit set to
128TB?


>
> Why is your brk starting so high? Are you trying to test the > 128TB
> case, or maybe something is confused by the 64->128TB change? What's
> the strace look like if you run on a distro or <= 4.10 kernel?
>
> Something like the following patch may help if you could test.
>
> Thanks,
> Nick
>

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-06  6:18     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 71+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06  6:18 UTC (permalink / raw)
  To: Nicholas Piggin, Florian Weimer, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm

Nicholas Piggin <npiggin@gmail.com> writes:

> On Fri, 3 Nov 2017 18:05:20 +0100
> Florian Weimer <fweimer@redhat.com> wrote:
>
>> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
>> variant, but I have not seen it on our own builders) where running 
>> localedef as part of the glibc build crashes with a segmentation fault.
>> 
>> Kernel version is 4.13.9 (Fedora 26 variant).
>> 
>> I have only seen this with an explicit loader invocation, like this:
>> 
>> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
>> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
>> charmaps/UTF-8 
>> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
>> ; done
>> 
>> To be run in the localedata subdirectory of a glibc *source* tree, after 
>> a build.  You may have to create the 
>> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
>> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
>> Fedora 26 host, but there it does not matter if you run the old (chroot) 
>> or newly built binary.
>> 
>> I filed this as a glibc bug for tracking:
>> 
>>    https://sourceware.org/bugzilla/show_bug.cgi?id=22390
>> 
>> There's an strace log and a coredump from the crash.
>> 
>> I think the data shows that the address in question should be writable.
>> 
>> The crossed 0x0000800000000000 binary is very suggestive.  I think that 
>> based on the operation of glibc's malloc, this write would be the first 
>> time this happens during the lifetime of the process.
>> 
>> Does that ring any bells?  Is there anything I can do to provide more 
>> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
>> any diagnostics tool which is provided by Fedora.
>
> There was a recent change to move to 128TB address space by default,
> and option for 512TB addresses if explicitly requested.
>
> Your brk request asked for > 128TB which the kernel gave it, but the
> address limit in the paca that the SLB miss tests against was not
> updated to reflect the switch to 512TB address space.

We should not return that address, unless we requested with a hint value
of > 128TB. IIRC we discussed this early during the mmap interface
change and said, we will return an address > 128T only if the hint
address is above 128TB (not hint addr + length). I am not sure why
we are finding us returning and address > 128TB with paca limit set to
128TB?


>
> Why is your brk starting so high? Are you trying to test the > 128TB
> case, or maybe something is confused by the 64->128TB change? What's
> the strace look like if you run on a distro or <= 4.10 kernel?
>
> Something like the following patch may help if you could test.
>
> Thanks,
> Nick
>

-aneesh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-06  6:18     ` Aneesh Kumar K.V
@ 2017-11-06  6:47       ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-06  6:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Florian Weimer, Kirill A. Shutemov, linuxppc-dev, linux-mm

On Mon, 06 Nov 2017 11:48:06 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > On Fri, 3 Nov 2017 18:05:20 +0100
> > Florian Weimer <fweimer@redhat.com> wrote:
> >  
> >> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
> >> variant, but I have not seen it on our own builders) where running 
> >> localedef as part of the glibc build crashes with a segmentation fault.
> >> 
> >> Kernel version is 4.13.9 (Fedora 26 variant).
> >> 
> >> I have only seen this with an explicit loader invocation, like this:
> >> 
> >> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
> >> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
> >> charmaps/UTF-8 
> >> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
> >> ; done
> >> 
> >> To be run in the localedata subdirectory of a glibc *source* tree, after 
> >> a build.  You may have to create the 
> >> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
> >> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
> >> Fedora 26 host, but there it does not matter if you run the old (chroot) 
> >> or newly built binary.
> >> 
> >> I filed this as a glibc bug for tracking:
> >> 
> >>    https://sourceware.org/bugzilla/show_bug.cgi?id=22390
> >> 
> >> There's an strace log and a coredump from the crash.
> >> 
> >> I think the data shows that the address in question should be writable.
> >> 
> >> The crossed 0x0000800000000000 binary is very suggestive.  I think that 
> >> based on the operation of glibc's malloc, this write would be the first 
> >> time this happens during the lifetime of the process.
> >> 
> >> Does that ring any bells?  Is there anything I can do to provide more 
> >> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
> >> any diagnostics tool which is provided by Fedora.  
> >
> > There was a recent change to move to 128TB address space by default,
> > and option for 512TB addresses if explicitly requested.
> >
> > Your brk request asked for > 128TB which the kernel gave it, but the
> > address limit in the paca that the SLB miss tests against was not
> > updated to reflect the switch to 512TB address space.  
> 
> We should not return that address, unless we requested with a hint value
> of > 128TB. IIRC we discussed this early during the mmap interface
> change and said, we will return an address > 128T only if the hint
> address is above 128TB (not hint addr + length).

Yeah, I'm thinking we should change that. Make explicit addr + length
hint return > 128TB. Well, it already seems to for this case, which
is why powerpc breaks.

This restriction was added for reasonably well written apps that just
happened to assume they don't get > 128TB va returned by mmap. An app
which asked for addr < 128TB && addr + len > 128TB and were relying on
that to fail is very different. I don't think we should add a big
unintuitive wart to the interface for such an obscure and broken type
of app.

"You get < 128TB unless explicitly requested."

Simple, reasonable, obvious rule. Avoids breaking apps that store
some bits in the top of pointers (provided that memory allocator
userspace libraries also do the right thing).

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-06  6:47       ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-06  6:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Florian Weimer, Kirill A. Shutemov, linuxppc-dev, linux-mm

On Mon, 06 Nov 2017 11:48:06 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > On Fri, 3 Nov 2017 18:05:20 +0100
> > Florian Weimer <fweimer@redhat.com> wrote:
> >  
> >> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
> >> variant, but I have not seen it on our own builders) where running 
> >> localedef as part of the glibc build crashes with a segmentation fault.
> >> 
> >> Kernel version is 4.13.9 (Fedora 26 variant).
> >> 
> >> I have only seen this with an explicit loader invocation, like this:
> >> 
> >> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
> >> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
> >> charmaps/UTF-8 
> >> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
> >> ; done
> >> 
> >> To be run in the localedata subdirectory of a glibc *source* tree, after 
> >> a build.  You may have to create the 
> >> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
> >> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
> >> Fedora 26 host, but there it does not matter if you run the old (chroot) 
> >> or newly built binary.
> >> 
> >> I filed this as a glibc bug for tracking:
> >> 
> >>    https://sourceware.org/bugzilla/show_bug.cgi?id=22390
> >> 
> >> There's an strace log and a coredump from the crash.
> >> 
> >> I think the data shows that the address in question should be writable.
> >> 
> >> The crossed 0x0000800000000000 binary is very suggestive.  I think that 
> >> based on the operation of glibc's malloc, this write would be the first 
> >> time this happens during the lifetime of the process.
> >> 
> >> Does that ring any bells?  Is there anything I can do to provide more 
> >> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
> >> any diagnostics tool which is provided by Fedora.  
> >
> > There was a recent change to move to 128TB address space by default,
> > and option for 512TB addresses if explicitly requested.
> >
> > Your brk request asked for > 128TB which the kernel gave it, but the
> > address limit in the paca that the SLB miss tests against was not
> > updated to reflect the switch to 512TB address space.  
> 
> We should not return that address, unless we requested with a hint value
> of > 128TB. IIRC we discussed this early during the mmap interface
> change and said, we will return an address > 128T only if the hint
> address is above 128TB (not hint addr + length).

Yeah, I'm thinking we should change that. Make explicit addr + length
hint return > 128TB. Well, it already seems to for this case, which
is why powerpc breaks.

This restriction was added for reasonably well written apps that just
happened to assume they don't get > 128TB va returned by mmap. An app
which asked for addr < 128TB && addr + len > 128TB and were relying on
that to fail is very different. I don't think we should add a big
unintuitive wart to the interface for such an obscure and broken type
of app.

"You get < 128TB unless explicitly requested."

Simple, reasonable, obvious rule. Avoids breaking apps that store
some bits in the top of pointers (provided that memory allocator
userspace libraries also do the right thing).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-06  6:18     ` Aneesh Kumar K.V
@ 2017-11-06  8:10       ` Florian Weimer
  -1 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-06  8:10 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Nicholas Piggin, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm

On 11/06/2017 07:18 AM, Aneesh Kumar K.V wrote:
> We should not return that address, unless we requested with a hint value
> of > 128TB. IIRC we discussed this early during the mmap interface
> change and said, we will return an address > 128T only if the hint
> address is above 128TB (not hint addr + length). I am not sure why
> we are finding us returning and address > 128TB with paca limit set to
> 128TB?

See the memory maps I posted.  I think it was not anticipated that the 
heap could be near the 128 TiB limit because it is placed next to the 
initially mapped object.

I think this could become worse once we have static PIE support because 
static PIE binaries likely have the same memory layout.  (Ordinary PIE 
does not.)

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-06  8:10       ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-06  8:10 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Nicholas Piggin, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm

On 11/06/2017 07:18 AM, Aneesh Kumar K.V wrote:
> We should not return that address, unless we requested with a hint value
> of > 128TB. IIRC we discussed this early during the mmap interface
> change and said, we will return an address > 128T only if the hint
> address is above 128TB (not hint addr + length). I am not sure why
> we are finding us returning and address > 128TB with paca limit set to
> 128TB?

See the memory maps I posted.  I think it was not anticipated that the 
heap could be near the 128 TiB limit because it is placed next to the 
initially mapped object.

I think this could become worse once we have static PIE support because 
static PIE binaries likely have the same memory layout.  (Ordinary PIE 
does not.)

Thanks,
Florian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-06  6:47       ` Nicholas Piggin
@ 2017-11-06  8:11         ` Florian Weimer
  -1 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-06  8:11 UTC (permalink / raw)
  To: Nicholas Piggin, Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, linuxppc-dev, linux-mm

On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
> "You get < 128TB unless explicitly requested."
> 
> Simple, reasonable, obvious rule. Avoids breaking apps that store
> some bits in the top of pointers (provided that memory allocator
> userspace libraries also do the right thing).

So brk would simplify fail instead of crossing the 128 TiB threshold?

glibc malloc should cope with that and switch to malloc, but this code 
path is obviously less well-tested than the regular way.

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-06  8:11         ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-06  8:11 UTC (permalink / raw)
  To: Nicholas Piggin, Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, linuxppc-dev, linux-mm

On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
> "You get < 128TB unless explicitly requested."
> 
> Simple, reasonable, obvious rule. Avoids breaking apps that store
> some bits in the top of pointers (provided that memory allocator
> userspace libraries also do the right thing).

So brk would simplify fail instead of crossing the 128 TiB threshold?

glibc malloc should cope with that and switch to malloc, but this code 
path is obviously less well-tested than the regular way.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-06  8:11         ` Florian Weimer
@ 2017-11-06  8:25           ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-06  8:25 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm

On Mon, 6 Nov 2017 09:11:37 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
> > "You get < 128TB unless explicitly requested."
> > 
> > Simple, reasonable, obvious rule. Avoids breaking apps that store
> > some bits in the top of pointers (provided that memory allocator
> > userspace libraries also do the right thing).  
> 
> So brk would simplify fail instead of crossing the 128 TiB threshold?

Yes, that was the intention and that's what x86 seems to do.

> 
> glibc malloc should cope with that and switch to malloc, but this code 
> path is obviously less well-tested than the regular way.

Switch to mmap() I guess you meant?

powerpc has a couple of bugs in corner cases, so those should be fixed
according to intended policy for stable kernels I think.

But I question the policy. Just seems like an ugly and ineffective wart.
Exactly for such cases as this -- behaviour would change from run to run
depending on your address space randomization for example! In case your
brk happens to land nicely on 128TB then the next one would succeed. If
not, then behaviour changes. And you didn't catch or prevent any bugs.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-06  8:25           ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-06  8:25 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm

On Mon, 6 Nov 2017 09:11:37 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
> > "You get < 128TB unless explicitly requested."
> > 
> > Simple, reasonable, obvious rule. Avoids breaking apps that store
> > some bits in the top of pointers (provided that memory allocator
> > userspace libraries also do the right thing).  
> 
> So brk would simplify fail instead of crossing the 128 TiB threshold?

Yes, that was the intention and that's what x86 seems to do.

> 
> glibc malloc should cope with that and switch to malloc, but this code 
> path is obviously less well-tested than the regular way.

Switch to mmap() I guess you meant?

powerpc has a couple of bugs in corner cases, so those should be fixed
according to intended policy for stable kernels I think.

But I question the policy. Just seems like an ugly and ineffective wart.
Exactly for such cases as this -- behaviour would change from run to run
depending on your address space randomization for example! In case your
brk happens to land nicely on 128TB then the next one would succeed. If
not, then behaviour changes. And you didn't catch or prevent any bugs.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-06  8:25           ` Nicholas Piggin
@ 2017-11-06  8:30             ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 71+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06  8:30 UTC (permalink / raw)
  To: Nicholas Piggin, Florian Weimer
  Cc: linux-mm, linuxppc-dev, Kirill A. Shutemov



On 11/06/2017 01:55 PM, Nicholas Piggin wrote:
> On Mon, 6 Nov 2017 09:11:37 +0100
> Florian Weimer <fweimer@redhat.com> wrote:
> 
>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
>>> "You get < 128TB unless explicitly requested."
>>>
>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
>>> some bits in the top of pointers (provided that memory allocator
>>> userspace libraries also do the right thing).
>>
>> So brk would simplify fail instead of crossing the 128 TiB threshold?
> 
> Yes, that was the intention and that's what x86 seems to do.
> 
>>
>> glibc malloc should cope with that and switch to malloc, but this code
>> path is obviously less well-tested than the regular way.
> 
> Switch to mmap() I guess you meant?
> 
> powerpc has a couple of bugs in corner cases, so those should be fixed
> according to intended policy for stable kernels I think.
> 
> But I question the policy. Just seems like an ugly and ineffective wart.
> Exactly for such cases as this -- behaviour would change from run to run
> depending on your address space randomization for example! In case your
> brk happens to land nicely on 128TB then the next one would succeed.

Why ? It should not change between run to run. We limit the free
area search range based on hint address. So we should get consistent 
results across run. even if we changed the context.addr_limit.

-aneesh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-06  8:30             ` Aneesh Kumar K.V
  0 siblings, 0 replies; 71+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-06  8:30 UTC (permalink / raw)
  To: Nicholas Piggin, Florian Weimer
  Cc: Kirill A. Shutemov, linuxppc-dev, linux-mm



On 11/06/2017 01:55 PM, Nicholas Piggin wrote:
> On Mon, 6 Nov 2017 09:11:37 +0100
> Florian Weimer <fweimer@redhat.com> wrote:
> 
>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
>>> "You get < 128TB unless explicitly requested."
>>>
>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
>>> some bits in the top of pointers (provided that memory allocator
>>> userspace libraries also do the right thing).
>>
>> So brk would simplify fail instead of crossing the 128 TiB threshold?
> 
> Yes, that was the intention and that's what x86 seems to do.
> 
>>
>> glibc malloc should cope with that and switch to malloc, but this code
>> path is obviously less well-tested than the regular way.
> 
> Switch to mmap() I guess you meant?
> 
> powerpc has a couple of bugs in corner cases, so those should be fixed
> according to intended policy for stable kernels I think.
> 
> But I question the policy. Just seems like an ugly and ineffective wart.
> Exactly for such cases as this -- behaviour would change from run to run
> depending on your address space randomization for example! In case your
> brk happens to land nicely on 128TB then the next one would succeed.

Why ? It should not change between run to run. We limit the free
area search range based on hint address. So we should get consistent 
results across run. even if we changed the context.addr_limit.

-aneesh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-06  8:30             ` Aneesh Kumar K.V
@ 2017-11-06  8:32               ` Florian Weimer
  -1 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-06  8:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Nicholas Piggin
  Cc: Kirill A. Shutemov, linuxppc-dev, linux-mm

On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:
> On 11/06/2017 01:55 PM, Nicholas Piggin wrote:
>> On Mon, 6 Nov 2017 09:11:37 +0100
>> Florian Weimer <fweimer@redhat.com> wrote:
>>
>>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
>>>> "You get < 128TB unless explicitly requested."
>>>>
>>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
>>>> some bits in the top of pointers (provided that memory allocator
>>>> userspace libraries also do the right thing).
>>>
>>> So brk would simplify fail instead of crossing the 128 TiB threshold?
>>
>> Yes, that was the intention and that's what x86 seems to do.
>>
>>>
>>> glibc malloc should cope with that and switch to malloc, but this code
>>> path is obviously less well-tested than the regular way.
>>
>> Switch to mmap() I guess you meant?

Yes, sorry.

>> powerpc has a couple of bugs in corner cases, so those should be fixed
>> according to intended policy for stable kernels I think.
>>
>> But I question the policy. Just seems like an ugly and ineffective wart.
>> Exactly for such cases as this -- behaviour would change from run to run
>> depending on your address space randomization for example! In case your
>> brk happens to land nicely on 128TB then the next one would succeed.
> 
> Why ? It should not change between run to run. We limit the free
> area search range based on hint address. So we should get consistent 
> results across run. even if we changed the context.addr_limit.

The size of the gap to the 128 TiB limit varies between runs because of 
ASLR.  So some runs would use brk alone, others would use brk + malloc. 
That's not really desirable IMHO.

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-06  8:32               ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-06  8:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Nicholas Piggin
  Cc: Kirill A. Shutemov, linuxppc-dev, linux-mm

On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:
> On 11/06/2017 01:55 PM, Nicholas Piggin wrote:
>> On Mon, 6 Nov 2017 09:11:37 +0100
>> Florian Weimer <fweimer@redhat.com> wrote:
>>
>>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
>>>> "You get < 128TB unless explicitly requested."
>>>>
>>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
>>>> some bits in the top of pointers (provided that memory allocator
>>>> userspace libraries also do the right thing).
>>>
>>> So brk would simplify fail instead of crossing the 128 TiB threshold?
>>
>> Yes, that was the intention and that's what x86 seems to do.
>>
>>>
>>> glibc malloc should cope with that and switch to malloc, but this code
>>> path is obviously less well-tested than the regular way.
>>
>> Switch to mmap() I guess you meant?

Yes, sorry.

>> powerpc has a couple of bugs in corner cases, so those should be fixed
>> according to intended policy for stable kernels I think.
>>
>> But I question the policy. Just seems like an ugly and ineffective wart.
>> Exactly for such cases as this -- behaviour would change from run to run
>> depending on your address space randomization for example! In case your
>> brk happens to land nicely on 128TB then the next one would succeed.
> 
> Why ? It should not change between run to run. We limit the free
> area search range based on hint address. So we should get consistent 
> results across run. even if we changed the context.addr_limit.

The size of the gap to the 128 TiB limit varies between runs because of 
ASLR.  So some runs would use brk alone, others would use brk + malloc. 
That's not really desirable IMHO.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-06  8:32               ` Florian Weimer
@ 2017-11-06 10:20                 ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:20 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm

On Mon, 6 Nov 2017 09:32:25 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:
> > On 11/06/2017 01:55 PM, Nicholas Piggin wrote:  
> >> On Mon, 6 Nov 2017 09:11:37 +0100
> >> Florian Weimer <fweimer@redhat.com> wrote:
> >>  
> >>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:  
> >>>> "You get < 128TB unless explicitly requested."
> >>>>
> >>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
> >>>> some bits in the top of pointers (provided that memory allocator
> >>>> userspace libraries also do the right thing).  
> >>>
> >>> So brk would simplify fail instead of crossing the 128 TiB threshold?  
> >>
> >> Yes, that was the intention and that's what x86 seems to do.
> >>  
> >>>
> >>> glibc malloc should cope with that and switch to malloc, but this code
> >>> path is obviously less well-tested than the regular way.  
> >>
> >> Switch to mmap() I guess you meant?  
> 
> Yes, sorry.
> 
> >> powerpc has a couple of bugs in corner cases, so those should be fixed
> >> according to intended policy for stable kernels I think.
> >>
> >> But I question the policy. Just seems like an ugly and ineffective wart.
> >> Exactly for such cases as this -- behaviour would change from run to run
> >> depending on your address space randomization for example! In case your
> >> brk happens to land nicely on 128TB then the next one would succeed.  
> > 
> > Why ? It should not change between run to run. We limit the free
> > area search range based on hint address. So we should get consistent 
> > results across run. even if we changed the context.addr_limit.  
> 
> The size of the gap to the 128 TiB limit varies between runs because of 
> ASLR.  So some runs would use brk alone, others would use brk + malloc. 
> That's not really desirable IMHO.

Yeah. Actually I looked at the code a bit more, and it seems that the
intention is for MAP_FIXED to do exactly what I wanted. brk() uses
MAP_FIXED under the covers, so this case should be okay I think. I'm
just slightly happier now, but I still think it's not the right thing
to do to fail an explicit request for crossing 128TB with a hint. Same
fundamental criticism still applies -- it does not really solve bugs
and just adds an unintuitive wart to the API, and a random change in
behaviour based on randomization.

Anyway I sent some patches that are split up better and hopefully solve
some bugs for powerpc without changing intended policy. That's left for
another discussion.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-06 10:20                 ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-06 10:20 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm

On Mon, 6 Nov 2017 09:32:25 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:
> > On 11/06/2017 01:55 PM, Nicholas Piggin wrote:  
> >> On Mon, 6 Nov 2017 09:11:37 +0100
> >> Florian Weimer <fweimer@redhat.com> wrote:
> >>  
> >>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:  
> >>>> "You get < 128TB unless explicitly requested."
> >>>>
> >>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
> >>>> some bits in the top of pointers (provided that memory allocator
> >>>> userspace libraries also do the right thing).  
> >>>
> >>> So brk would simplify fail instead of crossing the 128 TiB threshold?  
> >>
> >> Yes, that was the intention and that's what x86 seems to do.
> >>  
> >>>
> >>> glibc malloc should cope with that and switch to malloc, but this code
> >>> path is obviously less well-tested than the regular way.  
> >>
> >> Switch to mmap() I guess you meant?  
> 
> Yes, sorry.
> 
> >> powerpc has a couple of bugs in corner cases, so those should be fixed
> >> according to intended policy for stable kernels I think.
> >>
> >> But I question the policy. Just seems like an ugly and ineffective wart.
> >> Exactly for such cases as this -- behaviour would change from run to run
> >> depending on your address space randomization for example! In case your
> >> brk happens to land nicely on 128TB then the next one would succeed.  
> > 
> > Why ? It should not change between run to run. We limit the free
> > area search range based on hint address. So we should get consistent 
> > results across run. even if we changed the context.addr_limit.  
> 
> The size of the gap to the 128 TiB limit varies between runs because of 
> ASLR.  So some runs would use brk alone, others would use brk + malloc. 
> That's not really desirable IMHO.

Yeah. Actually I looked at the code a bit more, and it seems that the
intention is for MAP_FIXED to do exactly what I wanted. brk() uses
MAP_FIXED under the covers, so this case should be okay I think. I'm
just slightly happier now, but I still think it's not the right thing
to do to fail an explicit request for crossing 128TB with a hint. Same
fundamental criticism still applies -- it does not really solve bugs
and just adds an unintuitive wart to the API, and a random change in
behaviour based on randomization.

Anyway I sent some patches that are split up better and hopefully solve
some bugs for powerpc without changing intended policy. That's left for
another discussion.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-06  8:32               ` Florian Weimer
@ 2017-11-07  5:07                 ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-07  5:07 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

C'ing everyone who was on the x86 56-bit user virtual address patch.

I think we need more time to discuss this behaviour, in light of the
regression Florian uncovered. I would propose we turn off the 56-bit
user virtual address support for x86 for 4.14, and powerpc would
follow and turn off its 512T support until we can get a better handle
on the problems. (Actually Florian initially hit a couple of bugs in
powerpc implementation, but pulling that string uncovers a whole lot
of difficulties.)

The bi-modal behavior switched based on a combination of mmap address
hint and MAP_FIXED just sucks. It's segregating our VA space with
some non-standard heuristics, and it doesn't seem to work very well.

What are we trying to do? Allow SAP HANA etc use huge address spaces
by coding to these specific mmap heuristics we're going to add,
rather than solving it properly in a way that requires adding a new
syscall or personality or prctl or sysctl. Okay, but the cost is that
despite best efforts, it still changes ABI behaviour for existing
applications and these heuristics will become baked into the ABI that
we will have to support. Not a good tradeoff IMO.

First of all, using addr and MAP_FIXED to develop our heuristic can
never really give unchanged ABI. It's an in-band signal. brk() is a
good example that steadily keeps incrementing address, so depending
on malloc usage and address space randomization, you will get a brk()
that ends exactly at 128T, then the next one will be >
DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

Second, the kernel can never completely solve the problem this way.
How do we know a malloc library will not ask for > 128TB addresses
and pass them to an unknowing application?

And lastly, there are a fair few bugs and places where description
in changelogs and mailing lists does not match code. You don't want
to know the mess in powerpc, but even x86 has two I can see:
MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
indicated it should not), arch_get_unmapped_area_topdown() with an
address hint is checking against TASK_SIZE rather than the limited
128TB address, so it looks like it won't follow the heuristics.

So unless everyone else thinks I'm crazy and disagrees, I'd ask for
a bit more time to make sure we get this interface right. I would
hope for something like prctl PR_SET_MM which can be used to set
our user virtual address bits on a fine grained basis. Maybe a
sysctl, maybe a personality. Something out-of-band. I don't wan to
get too far into that discussion yet. First we need to agree whether
or not the code in the tree today is a problem.

Thanks,
Nick

On Mon, 6 Nov 2017 09:32:25 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:
> > On 11/06/2017 01:55 PM, Nicholas Piggin wrote:  
> >> On Mon, 6 Nov 2017 09:11:37 +0100
> >> Florian Weimer <fweimer@redhat.com> wrote:
> >>  
> >>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:  
> >>>> "You get < 128TB unless explicitly requested."
> >>>>
> >>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
> >>>> some bits in the top of pointers (provided that memory allocator
> >>>> userspace libraries also do the right thing).  
> >>>
> >>> So brk would simplify fail instead of crossing the 128 TiB threshold?  
> >>
> >> Yes, that was the intention and that's what x86 seems to do.
> >>  
> >>>
> >>> glibc malloc should cope with that and switch to malloc, but this code
> >>> path is obviously less well-tested than the regular way.  
> >>
> >> Switch to mmap() I guess you meant?  
> 
> Yes, sorry.
> 
> >> powerpc has a couple of bugs in corner cases, so those should be fixed
> >> according to intended policy for stable kernels I think.
> >>
> >> But I question the policy. Just seems like an ugly and ineffective wart.
> >> Exactly for such cases as this -- behaviour would change from run to run
> >> depending on your address space randomization for example! In case your
> >> brk happens to land nicely on 128TB then the next one would succeed.  
> > 
> > Why ? It should not change between run to run. We limit the free
> > area search range based on hint address. So we should get consistent 
> > results across run. even if we changed the context.addr_limit.  
> 
> The size of the gap to the 128 TiB limit varies between runs because of 
> ASLR.  So some runs would use brk alone, others would use brk + malloc. 
> That's not really desirable IMHO.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07  5:07                 ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-07  5:07 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

C'ing everyone who was on the x86 56-bit user virtual address patch.

I think we need more time to discuss this behaviour, in light of the
regression Florian uncovered. I would propose we turn off the 56-bit
user virtual address support for x86 for 4.14, and powerpc would
follow and turn off its 512T support until we can get a better handle
on the problems. (Actually Florian initially hit a couple of bugs in
powerpc implementation, but pulling that string uncovers a whole lot
of difficulties.)

The bi-modal behavior switched based on a combination of mmap address
hint and MAP_FIXED just sucks. It's segregating our VA space with
some non-standard heuristics, and it doesn't seem to work very well.

What are we trying to do? Allow SAP HANA etc use huge address spaces
by coding to these specific mmap heuristics we're going to add,
rather than solving it properly in a way that requires adding a new
syscall or personality or prctl or sysctl. Okay, but the cost is that
despite best efforts, it still changes ABI behaviour for existing
applications and these heuristics will become baked into the ABI that
we will have to support. Not a good tradeoff IMO.

First of all, using addr and MAP_FIXED to develop our heuristic can
never really give unchanged ABI. It's an in-band signal. brk() is a
good example that steadily keeps incrementing address, so depending
on malloc usage and address space randomization, you will get a brk()
that ends exactly at 128T, then the next one will be >
DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

Second, the kernel can never completely solve the problem this way.
How do we know a malloc library will not ask for > 128TB addresses
and pass them to an unknowing application?

And lastly, there are a fair few bugs and places where description
in changelogs and mailing lists does not match code. You don't want
to know the mess in powerpc, but even x86 has two I can see:
MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
indicated it should not), arch_get_unmapped_area_topdown() with an
address hint is checking against TASK_SIZE rather than the limited
128TB address, so it looks like it won't follow the heuristics.

So unless everyone else thinks I'm crazy and disagrees, I'd ask for
a bit more time to make sure we get this interface right. I would
hope for something like prctl PR_SET_MM which can be used to set
our user virtual address bits on a fine grained basis. Maybe a
sysctl, maybe a personality. Something out-of-band. I don't wan to
get too far into that discussion yet. First we need to agree whether
or not the code in the tree today is a problem.

Thanks,
Nick

On Mon, 6 Nov 2017 09:32:25 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:
> > On 11/06/2017 01:55 PM, Nicholas Piggin wrote:  
> >> On Mon, 6 Nov 2017 09:11:37 +0100
> >> Florian Weimer <fweimer@redhat.com> wrote:
> >>  
> >>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:  
> >>>> "You get < 128TB unless explicitly requested."
> >>>>
> >>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
> >>>> some bits in the top of pointers (provided that memory allocator
> >>>> userspace libraries also do the right thing).  
> >>>
> >>> So brk would simplify fail instead of crossing the 128 TiB threshold?  
> >>
> >> Yes, that was the intention and that's what x86 seems to do.
> >>  
> >>>
> >>> glibc malloc should cope with that and switch to malloc, but this code
> >>> path is obviously less well-tested than the regular way.  
> >>
> >> Switch to mmap() I guess you meant?  
> 
> Yes, sorry.
> 
> >> powerpc has a couple of bugs in corner cases, so those should be fixed
> >> according to intended policy for stable kernels I think.
> >>
> >> But I question the policy. Just seems like an ugly and ineffective wart.
> >> Exactly for such cases as this -- behaviour would change from run to run
> >> depending on your address space randomization for example! In case your
> >> brk happens to land nicely on 128TB then the next one would succeed.  
> > 
> > Why ? It should not change between run to run. We limit the free
> > area search range based on hint address. So we should get consistent 
> > results across run. even if we changed the context.addr_limit.  
> 
> The size of the gap to the 128 TiB limit varies between runs because of 
> ASLR.  So some runs would use brk alone, others would use brk + malloc. 
> That's not really desirable IMHO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07  5:07                 ` Nicholas Piggin
@ 2017-11-07  8:15                   ` Florian Weimer
  -1 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-07  8:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On 11/07/2017 06:07 AM, Nicholas Piggin wrote:

> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

Note that this brk phenomenon is only a concern for some currently 
obscure process memory layouts where the heap ends up at the top of the 
address space.  Usually, there is something above it which eliminates 
the possibility that it can cross into the 128 TiB wilderness.  So the 
brk problem only happens on some architectures (e.g., not x86-64), and 
only with strange ways of running programs (explicitly ld.so invocation 
and likely static PIE, too).

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

There is certainly more demand for similar functionality, like creating 
mappings below 2 GB/4 GB/32 GB, and probably other bit patterns. 
Hotspot would use this to place the heap with compressed oops, instead 
of manually hunting for a suitable place for the mapping.  (Essentially, 
32-bit pointers on 64-bit architectures for sufficiently small heap 
sizes.)  It would perhaps be possible to use the hints address as a 
source of the bit count, for full flexibility.  And the mapping should 
be placed into the upper half of the selected window if possible.

MAP_FIXED is near-impossible to use correctly.  I hope you don't expect 
applications to do that.  If you want address-based opt in, it should 
work without MAP_FIXED.  Sure, in obscure cases, applications might 
still see out-of-range addresses, but I expected a full opt-out based on 
RLIMIT_AS would be sufficient for them.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07  8:15                   ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-07  8:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On 11/07/2017 06:07 AM, Nicholas Piggin wrote:

> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

Note that this brk phenomenon is only a concern for some currently 
obscure process memory layouts where the heap ends up at the top of the 
address space.  Usually, there is something above it which eliminates 
the possibility that it can cross into the 128 TiB wilderness.  So the 
brk problem only happens on some architectures (e.g., not x86-64), and 
only with strange ways of running programs (explicitly ld.so invocation 
and likely static PIE, too).

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

There is certainly more demand for similar functionality, like creating 
mappings below 2 GB/4 GB/32 GB, and probably other bit patterns. 
Hotspot would use this to place the heap with compressed oops, instead 
of manually hunting for a suitable place for the mapping.  (Essentially, 
32-bit pointers on 64-bit architectures for sufficiently small heap 
sizes.)  It would perhaps be possible to use the hints address as a 
source of the bit count, for full flexibility.  And the mapping should 
be placed into the upper half of the selected window if possible.

MAP_FIXED is near-impossible to use correctly.  I hope you don't expect 
applications to do that.  If you want address-based opt in, it should 
work without MAP_FIXED.  Sure, in obscure cases, applications might 
still see out-of-range addresses, but I expected a full opt-out based on 
RLIMIT_AS would be sufficient for them.

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07  8:15                   ` Florian Weimer
@ 2017-11-07  9:24                     ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-07  9:24 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Tue, 7 Nov 2017 09:15:21 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/07/2017 06:07 AM, Nicholas Piggin wrote:
> 
> > First of all, using addr and MAP_FIXED to develop our heuristic can
> > never really give unchanged ABI. It's an in-band signal. brk() is a
> > good example that steadily keeps incrementing address, so depending
> > on malloc usage and address space randomization, you will get a brk()
> > that ends exactly at 128T, then the next one will be >
> > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.  
> 
> Note that this brk phenomenon is only a concern for some currently 
> obscure process memory layouts where the heap ends up at the top of the 
> address space.  Usually, there is something above it which eliminates 
> the possibility that it can cross into the 128 TiB wilderness.  So the 
> brk problem only happens on some architectures (e.g., not x86-64), and 
> only with strange ways of running programs (explicitly ld.so invocation 
> and likely static PIE, too).

That's true, but there was an ABI change and the result is that it
changed behaviour.

And actually if it were not for a powerpc bug that caused a segfault,
the allocation would have worked, and you probably would never have
filed the bug. However the program would have been given a pointer >
128TB, which is what we were trying to avoid -- if your app also put
metadata in the top of the pointers, it would have broken.

So, are any obscure apps that could break? I don't know. Maybe it's
quite unlikely. Is that enough to go ahead with changing behaviour?

I'm not going to put up a big fight about this if others feel it's
not a problem. I raise it because in debugging this and seeing the
changed behaviour, the differences between implementations (x86,
powerpc hash, and powerpc radix, all have different edge case
behaviour), and that none of them seem to conform exactly to the
heuristics described in the changelogs, and there seems to be no man
page updates that I can see, it raises some red flags.

So I'm calling for one last opportunity to 1) agree on the desired
behaviour, and 2) ensure implementations are conforming.

> > So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> > a bit more time to make sure we get this interface right. I would
> > hope for something like prctl PR_SET_MM which can be used to set
> > our user virtual address bits on a fine grained basis. Maybe a
> > sysctl, maybe a personality. Something out-of-band. I don't wan to
> > get too far into that discussion yet. First we need to agree whether
> > or not the code in the tree today is a problem.  
> 
> There is certainly more demand for similar functionality, like creating 
> mappings below 2 GB/4 GB/32 GB, and probably other bit patterns. 
> Hotspot would use this to place the heap with compressed oops, instead 
> of manually hunting for a suitable place for the mapping.  (Essentially, 
> 32-bit pointers on 64-bit architectures for sufficiently small heap 
> sizes.)  It would perhaps be possible to use the hints address as a 
> source of the bit count, for full flexibility.  And the mapping should 
> be placed into the upper half of the selected window if possible.
> 
> MAP_FIXED is near-impossible to use correctly.  I hope you don't expect 
> applications to do that.  If you want address-based opt in, it should 
> work without MAP_FIXED.  Sure, in obscure cases, applications might 
> still see out-of-range addresses, but I expected a full opt-out based on 
> RLIMIT_AS would be sufficient for them.

All good points, and no I don't think MAP_FIXED is a substitute.

I don't want to get too far ahead of ourselves, but I don't see why
some new interfaces with reasonably flexible and extensible ways to
specify VA behaviour is a bad idea. We already went through similar
with the ADDR_COMPAT_LAYOUT, and some various address space
randomization options, and now this, and they're all different...
As you noted, for many years now with compressed pointers and data
in pointers, address space bits have *mattered* to applications and
we can't change that.

Surely we're going to have to solve it *properly* sooner or later,
why not now? We could do both existing heuristic *and* make nicer
interfaces, but the >128TB support seems like a good place to do it
because very few apps will have to change code, and we can leave
behaviour exactly unchanged for 99.99999% that will never need so
much memory.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07  9:24                     ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-07  9:24 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Tue, 7 Nov 2017 09:15:21 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/07/2017 06:07 AM, Nicholas Piggin wrote:
> 
> > First of all, using addr and MAP_FIXED to develop our heuristic can
> > never really give unchanged ABI. It's an in-band signal. brk() is a
> > good example that steadily keeps incrementing address, so depending
> > on malloc usage and address space randomization, you will get a brk()
> > that ends exactly at 128T, then the next one will be >
> > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.  
> 
> Note that this brk phenomenon is only a concern for some currently 
> obscure process memory layouts where the heap ends up at the top of the 
> address space.  Usually, there is something above it which eliminates 
> the possibility that it can cross into the 128 TiB wilderness.  So the 
> brk problem only happens on some architectures (e.g., not x86-64), and 
> only with strange ways of running programs (explicitly ld.so invocation 
> and likely static PIE, too).

That's true, but there was an ABI change and the result is that it
changed behaviour.

And actually if it were not for a powerpc bug that caused a segfault,
the allocation would have worked, and you probably would never have
filed the bug. However the program would have been given a pointer >
128TB, which is what we were trying to avoid -- if your app also put
metadata in the top of the pointers, it would have broken.

So, are any obscure apps that could break? I don't know. Maybe it's
quite unlikely. Is that enough to go ahead with changing behaviour?

I'm not going to put up a big fight about this if others feel it's
not a problem. I raise it because in debugging this and seeing the
changed behaviour, the differences between implementations (x86,
powerpc hash, and powerpc radix, all have different edge case
behaviour), and that none of them seem to conform exactly to the
heuristics described in the changelogs, and there seems to be no man
page updates that I can see, it raises some red flags.

So I'm calling for one last opportunity to 1) agree on the desired
behaviour, and 2) ensure implementations are conforming.

> > So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> > a bit more time to make sure we get this interface right. I would
> > hope for something like prctl PR_SET_MM which can be used to set
> > our user virtual address bits on a fine grained basis. Maybe a
> > sysctl, maybe a personality. Something out-of-band. I don't wan to
> > get too far into that discussion yet. First we need to agree whether
> > or not the code in the tree today is a problem.  
> 
> There is certainly more demand for similar functionality, like creating 
> mappings below 2 GB/4 GB/32 GB, and probably other bit patterns. 
> Hotspot would use this to place the heap with compressed oops, instead 
> of manually hunting for a suitable place for the mapping.  (Essentially, 
> 32-bit pointers on 64-bit architectures for sufficiently small heap 
> sizes.)  It would perhaps be possible to use the hints address as a 
> source of the bit count, for full flexibility.  And the mapping should 
> be placed into the upper half of the selected window if possible.
> 
> MAP_FIXED is near-impossible to use correctly.  I hope you don't expect 
> applications to do that.  If you want address-based opt in, it should 
> work without MAP_FIXED.  Sure, in obscure cases, applications might 
> still see out-of-range addresses, but I expected a full opt-out based on 
> RLIMIT_AS would be sufficient for them.

All good points, and no I don't think MAP_FIXED is a substitute.

I don't want to get too far ahead of ourselves, but I don't see why
some new interfaces with reasonably flexible and extensible ways to
specify VA behaviour is a bad idea. We already went through similar
with the ADDR_COMPAT_LAYOUT, and some various address space
randomization options, and now this, and they're all different...
As you noted, for many years now with compressed pointers and data
in pointers, address space bits have *mattered* to applications and
we can't change that.

Surely we're going to have to solve it *properly* sooner or later,
why not now? We could do both existing heuristic *and* make nicer
interfaces, but the >128TB support seems like a good place to do it
because very few apps will have to change code, and we can leave
behaviour exactly unchanged for 99.99999% that will never need so
much memory.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07  5:07                 ` Nicholas Piggin
@ 2017-11-07 11:15                   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> C'ing everyone who was on the x86 56-bit user virtual address patch.
> 
> I think we need more time to discuss this behaviour, in light of the
> regression Florian uncovered. I would propose we turn off the 56-bit
> user virtual address support for x86 for 4.14, and powerpc would
> follow and turn off its 512T support until we can get a better handle
> on the problems. (Actually Florian initially hit a couple of bugs in
> powerpc implementation, but pulling that string uncovers a whole lot
> of difficulties.)
> 
> The bi-modal behavior switched based on a combination of mmap address
> hint and MAP_FIXED just sucks. It's segregating our VA space with
> some non-standard heuristics, and it doesn't seem to work very well.
> 
> What are we trying to do? Allow SAP HANA etc use huge address spaces
> by coding to these specific mmap heuristics we're going to add,
> rather than solving it properly in a way that requires adding a new
> syscall or personality or prctl or sysctl. Okay, but the cost is that
> despite best efforts, it still changes ABI behaviour for existing
> applications and these heuristics will become baked into the ABI that
> we will have to support. Not a good tradeoff IMO.
> 
> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

No, it won't. You will hit stack first.

> Second, the kernel can never completely solve the problem this way.
> How do we know a malloc library will not ask for > 128TB addresses
> and pass them to an unknowing application?

The idea is that an application can provide hint (mallopt() ?) to malloc
implementation that it's ready to full address space. In this case, malloc
can use mmap((void *) -1,...) for its allocations and get full address
space this way.

> And lastly, there are a fair few bugs and places where description
> in changelogs and mailing lists does not match code. You don't want
> to know the mess in powerpc, but even x86 has two I can see:
> MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> indicated it should not),

Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
shouldn't work. My intention was that it should, although I haven't stated
it in the changelog.

The idea was we shouldn't allow to slip above 47-bits by accidentally.

Correctly functioning program would never request addr+len above 47-bit
with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
request would simply fail on machine that doesn't support large VA.

In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
machine that doesn't support large VA, kernel will find another place
under 47-bit. And I can imagine a reasonable application that does
something like this.

So we cannot rely that application is ready to handle large
addresses if we see addr+len without MAP_FIXED.

> arch_get_unmapped_area_topdown() with an address hint is checking
> against TASK_SIZE rather than the limited 128TB address, so it looks
> like it won't follow the heuristics.

You are right. This is broken. If user would request mapping above vdso,
but below DEFAULT_MAP_WINDOW it will succeed.

I'll send patch to fix this. But it doesn't look as a show-stopper to me.

Re-checking things for this reply I found actual bug, see:

http://lkml.kernel.org/r/20171107103804.47341-1-kirill.shutemov@linux.intel.com

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

Well, we've discussed before all options you are proposing.
Linus wanted a minimalistic interface, so we took this path for now.
We can always add more ways to get access to full address space later.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 11:15                   ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> C'ing everyone who was on the x86 56-bit user virtual address patch.
> 
> I think we need more time to discuss this behaviour, in light of the
> regression Florian uncovered. I would propose we turn off the 56-bit
> user virtual address support for x86 for 4.14, and powerpc would
> follow and turn off its 512T support until we can get a better handle
> on the problems. (Actually Florian initially hit a couple of bugs in
> powerpc implementation, but pulling that string uncovers a whole lot
> of difficulties.)
> 
> The bi-modal behavior switched based on a combination of mmap address
> hint and MAP_FIXED just sucks. It's segregating our VA space with
> some non-standard heuristics, and it doesn't seem to work very well.
> 
> What are we trying to do? Allow SAP HANA etc use huge address spaces
> by coding to these specific mmap heuristics we're going to add,
> rather than solving it properly in a way that requires adding a new
> syscall or personality or prctl or sysctl. Okay, but the cost is that
> despite best efforts, it still changes ABI behaviour for existing
> applications and these heuristics will become baked into the ABI that
> we will have to support. Not a good tradeoff IMO.
> 
> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

No, it won't. You will hit stack first.

> Second, the kernel can never completely solve the problem this way.
> How do we know a malloc library will not ask for > 128TB addresses
> and pass them to an unknowing application?

The idea is that an application can provide hint (mallopt() ?) to malloc
implementation that it's ready to full address space. In this case, malloc
can use mmap((void *) -1,...) for its allocations and get full address
space this way.

> And lastly, there are a fair few bugs and places where description
> in changelogs and mailing lists does not match code. You don't want
> to know the mess in powerpc, but even x86 has two I can see:
> MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> indicated it should not),

Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
shouldn't work. My intention was that it should, although I haven't stated
it in the changelog.

The idea was we shouldn't allow to slip above 47-bits by accidentally.

Correctly functioning program would never request addr+len above 47-bit
with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
request would simply fail on machine that doesn't support large VA.

In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
machine that doesn't support large VA, kernel will find another place
under 47-bit. And I can imagine a reasonable application that does
something like this.

So we cannot rely that application is ready to handle large
addresses if we see addr+len without MAP_FIXED.

> arch_get_unmapped_area_topdown() with an address hint is checking
> against TASK_SIZE rather than the limited 128TB address, so it looks
> like it won't follow the heuristics.

You are right. This is broken. If user would request mapping above vdso,
but below DEFAULT_MAP_WINDOW it will succeed.

I'll send patch to fix this. But it doesn't look as a show-stopper to me.

Re-checking things for this reply I found actual bug, see:

http://lkml.kernel.org/r/20171107103804.47341-1-kirill.shutemov@linux.intel.com

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

Well, we've discussed before all options you are proposing.
Linus wanted a minimalistic interface, so we took this path for now.
We can always add more ways to get access to full address space later.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07  8:15                   ` Florian Weimer
@ 2017-11-07 11:16                     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 09:15:21AM +0100, Florian Weimer wrote:
> MAP_FIXED is near-impossible to use correctly.  I hope you don't expect
> applications to do that.  If you want address-based opt in, it should work
> without MAP_FIXED.  Sure, in obscure cases, applications might still see
> out-of-range addresses, but I expected a full opt-out based on RLIMIT_AS
> would be sufficient for them.

Just use mmap(-1), without MAP_FIXED to get full address space.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 11:16                     ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 09:15:21AM +0100, Florian Weimer wrote:
> MAP_FIXED is near-impossible to use correctly.  I hope you don't expect
> applications to do that.  If you want address-based opt in, it should work
> without MAP_FIXED.  Sure, in obscure cases, applications might still see
> out-of-range addresses, but I expected a full opt-out based on RLIMIT_AS
> would be sufficient for them.

Just use mmap(-1), without MAP_FIXED to get full address space.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:15                   ` Kirill A. Shutemov
@ 2017-11-07 11:26                     ` Florian Weimer
  -1 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-07 11:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, Nicholas Piggin
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:

>> First of all, using addr and MAP_FIXED to develop our heuristic can
>> never really give unchanged ABI. It's an in-band signal. brk() is a
>> good example that steadily keeps incrementing address, so depending
>> on malloc usage and address space randomization, you will get a brk()
>> that ends exactly at 128T, then the next one will be >
>> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> 
> No, it won't. You will hit stack first.

That's not actually true on POWER in some cases.  See the process maps I 
posted here:

   <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>

Thanks,
Florian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 11:26                     ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-07 11:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, Nicholas Piggin
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:

>> First of all, using addr and MAP_FIXED to develop our heuristic can
>> never really give unchanged ABI. It's an in-band signal. brk() is a
>> good example that steadily keeps incrementing address, so depending
>> on malloc usage and address space randomization, you will get a brk()
>> that ends exactly at 128T, then the next one will be >
>> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> 
> No, it won't. You will hit stack first.

That's not actually true on POWER in some cases.  See the process maps I 
posted here:

   <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:26                     ` Florian Weimer
@ 2017-11-07 11:44                       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:44 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> 
> > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > good example that steadily keeps incrementing address, so depending
> > > on malloc usage and address space randomization, you will get a brk()
> > > that ends exactly at 128T, then the next one will be >
> > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > 
> > No, it won't. You will hit stack first.
> 
> That's not actually true on POWER in some cases.  See the process maps I
> posted here:
> 
>   <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>

Hm? I see that in all three cases the [stack] is the last mapping.
Do I miss something?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 11:44                       ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:44 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> 
> > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > good example that steadily keeps incrementing address, so depending
> > > on malloc usage and address space randomization, you will get a brk()
> > > that ends exactly at 128T, then the next one will be >
> > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > 
> > No, it won't. You will hit stack first.
> 
> That's not actually true on POWER in some cases.  See the process maps I
> posted here:
> 
>   <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>

Hm? I see that in all three cases the [stack] is the last mapping.
Do I miss something?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:15                   ` Kirill A. Shutemov
@ 2017-11-07 11:56                     ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-07 11:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 14:15:43 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> > C'ing everyone who was on the x86 56-bit user virtual address patch.
> > 
> > I think we need more time to discuss this behaviour, in light of the
> > regression Florian uncovered. I would propose we turn off the 56-bit
> > user virtual address support for x86 for 4.14, and powerpc would
> > follow and turn off its 512T support until we can get a better handle
> > on the problems. (Actually Florian initially hit a couple of bugs in
> > powerpc implementation, but pulling that string uncovers a whole lot
> > of difficulties.)
> > 
> > The bi-modal behavior switched based on a combination of mmap address
> > hint and MAP_FIXED just sucks. It's segregating our VA space with
> > some non-standard heuristics, and it doesn't seem to work very well.
> > 
> > What are we trying to do? Allow SAP HANA etc use huge address spaces
> > by coding to these specific mmap heuristics we're going to add,
> > rather than solving it properly in a way that requires adding a new
> > syscall or personality or prctl or sysctl. Okay, but the cost is that
> > despite best efforts, it still changes ABI behaviour for existing
> > applications and these heuristics will become baked into the ABI that
> > we will have to support. Not a good tradeoff IMO.
> > 
> > First of all, using addr and MAP_FIXED to develop our heuristic can
> > never really give unchanged ABI. It's an in-band signal. brk() is a
> > good example that steadily keeps incrementing address, so depending
> > on malloc usage and address space randomization, you will get a brk()
> > that ends exactly at 128T, then the next one will be >
> > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.  
> 
> No, it won't. You will hit stack first.

I guess so. Florian's bug didn't crash there for some reason, okay
but I suppose my point about brk is not exactly where the standard
heap is, but the pattern of allocations. An allocator that uses
mmap for managing its address space might do the same thing, e.g.,
incrementally expand existing mmaps as necessary.

> > Second, the kernel can never completely solve the problem this way.
> > How do we know a malloc library will not ask for > 128TB addresses
> > and pass them to an unknowing application?  
> 
> The idea is that an application can provide hint (mallopt() ?) to malloc
> implementation that it's ready to full address space. In this case, malloc
> can use mmap((void *) -1,...) for its allocations and get full address
> space this way.

Point is, there's nothing stopping an allocator library or runtime
from asking for mmap anywhere and returning it to userspace.

Do > 128TB pointers matter so much that we should add this heuristic
to prevent breakage, but little enough that we can accept some rare
cases getting through? Genuine question.

> 
> > And lastly, there are a fair few bugs and places where description
> > in changelogs and mailing lists does not match code. You don't want
> > to know the mess in powerpc, but even x86 has two I can see:
> > MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> > indicated it should not),  
> 
> Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
> shouldn't work. My intention was that it should, although I haven't stated
> it in the changelog.

    To mitigate this, we are not going to allocate virtual address space
    above 47-bit by default.
    
    But userspace can ask for allocation from full address space by
    specifying hint address (with or without MAP_FIXED) above 47-bits.

Yet we got 48 bit address with 47 bit address (with MAP_FIXED).

> 
> The idea was we shouldn't allow to slip above 47-bits by accidentally.
> 
> Correctly functioning program would never request addr+len above 47-bit
> with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> request would simply fail on machine that doesn't support large VA.
> 
> In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> machine that doesn't support large VA, kernel will find another place
> under 47-bit. And I can imagine a reasonable application that does
> something like this.
> 
> So we cannot rely that application is ready to handle large
> addresses if we see addr+len without MAP_FIXED.

By the same logic, a request for addr > 128TB without MAP_FIXED will
not fail, therefore we can't rely on that either.

Or an app that links to a library that attempts MAP_FIXED allocation
of addr + len above 128TB might use high bits of pointer returned by
that library because those are never satisfied today and the library
would fall back.

> 
> > arch_get_unmapped_area_topdown() with an address hint is checking
> > against TASK_SIZE rather than the limited 128TB address, so it looks
> > like it won't follow the heuristics.  
> 
> You are right. This is broken. If user would request mapping above vdso,
> but below DEFAULT_MAP_WINDOW it will succeed.
> 
> I'll send patch to fix this. But it doesn't look as a show-stopper to me.
> 
> Re-checking things for this reply I found actual bug, see:
> 
> http://lkml.kernel.org/r/20171107103804.47341-1-kirill.shutemov@linux.intel.com

It's not a show stopper per se, and of course I don't expect code to be
bug free, it's just concerning that we haven't had any kind of regression
testing or observations that have checked for this.

> 
> > So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> > a bit more time to make sure we get this interface right. I would
> > hope for something like prctl PR_SET_MM which can be used to set
> > our user virtual address bits on a fine grained basis. Maybe a
> > sysctl, maybe a personality. Something out-of-band. I don't wan to
> > get too far into that discussion yet. First we need to agree whether
> > or not the code in the tree today is a problem.  
> 
> Well, we've discussed before all options you are proposing.
> Linus wanted a minimalistic interface, so we took this path for now.
> We can always add more ways to get access to full address space later.
> 

Right, I'm just raising it again with additional justifications. I don't
think it's the right way to go or has to be rushed now.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 11:56                     ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-07 11:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 14:15:43 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> > C'ing everyone who was on the x86 56-bit user virtual address patch.
> > 
> > I think we need more time to discuss this behaviour, in light of the
> > regression Florian uncovered. I would propose we turn off the 56-bit
> > user virtual address support for x86 for 4.14, and powerpc would
> > follow and turn off its 512T support until we can get a better handle
> > on the problems. (Actually Florian initially hit a couple of bugs in
> > powerpc implementation, but pulling that string uncovers a whole lot
> > of difficulties.)
> > 
> > The bi-modal behavior switched based on a combination of mmap address
> > hint and MAP_FIXED just sucks. It's segregating our VA space with
> > some non-standard heuristics, and it doesn't seem to work very well.
> > 
> > What are we trying to do? Allow SAP HANA etc use huge address spaces
> > by coding to these specific mmap heuristics we're going to add,
> > rather than solving it properly in a way that requires adding a new
> > syscall or personality or prctl or sysctl. Okay, but the cost is that
> > despite best efforts, it still changes ABI behaviour for existing
> > applications and these heuristics will become baked into the ABI that
> > we will have to support. Not a good tradeoff IMO.
> > 
> > First of all, using addr and MAP_FIXED to develop our heuristic can
> > never really give unchanged ABI. It's an in-band signal. brk() is a
> > good example that steadily keeps incrementing address, so depending
> > on malloc usage and address space randomization, you will get a brk()
> > that ends exactly at 128T, then the next one will be >
> > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.  
> 
> No, it won't. You will hit stack first.

I guess so. Florian's bug didn't crash there for some reason, okay
but I suppose my point about brk is not exactly where the standard
heap is, but the pattern of allocations. An allocator that uses
mmap for managing its address space might do the same thing, e.g.,
incrementally expand existing mmaps as necessary.

> > Second, the kernel can never completely solve the problem this way.
> > How do we know a malloc library will not ask for > 128TB addresses
> > and pass them to an unknowing application?  
> 
> The idea is that an application can provide hint (mallopt() ?) to malloc
> implementation that it's ready to full address space. In this case, malloc
> can use mmap((void *) -1,...) for its allocations and get full address
> space this way.

Point is, there's nothing stopping an allocator library or runtime
from asking for mmap anywhere and returning it to userspace.

Do > 128TB pointers matter so much that we should add this heuristic
to prevent breakage, but little enough that we can accept some rare
cases getting through? Genuine question.

> 
> > And lastly, there are a fair few bugs and places where description
> > in changelogs and mailing lists does not match code. You don't want
> > to know the mess in powerpc, but even x86 has two I can see:
> > MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> > indicated it should not),  
> 
> Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
> shouldn't work. My intention was that it should, although I haven't stated
> it in the changelog.

    To mitigate this, we are not going to allocate virtual address space
    above 47-bit by default.
    
    But userspace can ask for allocation from full address space by
    specifying hint address (with or without MAP_FIXED) above 47-bits.

Yet we got 48 bit address with 47 bit address (with MAP_FIXED).

> 
> The idea was we shouldn't allow to slip above 47-bits by accidentally.
> 
> Correctly functioning program would never request addr+len above 47-bit
> with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> request would simply fail on machine that doesn't support large VA.
> 
> In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> machine that doesn't support large VA, kernel will find another place
> under 47-bit. And I can imagine a reasonable application that does
> something like this.
> 
> So we cannot rely that application is ready to handle large
> addresses if we see addr+len without MAP_FIXED.

By the same logic, a request for addr > 128TB without MAP_FIXED will
not fail, therefore we can't rely on that either.

Or an app that links to a library that attempts MAP_FIXED allocation
of addr + len above 128TB might use high bits of pointer returned by
that library because those are never satisfied today and the library
would fall back.

> 
> > arch_get_unmapped_area_topdown() with an address hint is checking
> > against TASK_SIZE rather than the limited 128TB address, so it looks
> > like it won't follow the heuristics.  
> 
> You are right. This is broken. If user would request mapping above vdso,
> but below DEFAULT_MAP_WINDOW it will succeed.
> 
> I'll send patch to fix this. But it doesn't look as a show-stopper to me.
> 
> Re-checking things for this reply I found actual bug, see:
> 
> http://lkml.kernel.org/r/20171107103804.47341-1-kirill.shutemov@linux.intel.com

It's not a show stopper per se, and of course I don't expect code to be
bug free, it's just concerning that we haven't had any kind of regression
testing or observations that have checked for this.

> 
> > So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> > a bit more time to make sure we get this interface right. I would
> > hope for something like prctl PR_SET_MM which can be used to set
> > our user virtual address bits on a fine grained basis. Maybe a
> > sysctl, maybe a personality. Something out-of-band. I don't wan to
> > get too far into that discussion yet. First we need to agree whether
> > or not the code in the tree today is a problem.  
> 
> Well, we've discussed before all options you are proposing.
> Linus wanted a minimalistic interface, so we took this path for now.
> We can always add more ways to get access to full address space later.
> 

Right, I'm just raising it again with additional justifications. I don't
think it's the right way to go or has to be rushed now.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:56                     ` Nicholas Piggin
@ 2017-11-07 12:28                       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 12:28 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 10:56:36PM +1100, Nicholas Piggin wrote:
> > No, it won't. You will hit stack first.
> 
> I guess so. Florian's bug didn't crash there for some reason, okay
> but I suppose my point about brk is not exactly where the standard
> heap is, but the pattern of allocations. An allocator that uses
> mmap for managing its address space might do the same thing, e.g.,
> incrementally expand existing mmaps as necessary.

With MAP_FIXED? I don't think so.

> > > Second, the kernel can never completely solve the problem this way.
> > > How do we know a malloc library will not ask for > 128TB addresses
> > > and pass them to an unknowing application?  
> > 
> > The idea is that an application can provide hint (mallopt() ?) to malloc
> > implementation that it's ready to full address space. In this case, malloc
> > can use mmap((void *) -1,...) for its allocations and get full address
> > space this way.
> 
> Point is, there's nothing stopping an allocator library or runtime
> from asking for mmap anywhere and returning it to userspace.

Right. Nobody would stop it from doing stupid things. There are many
things that a library may do that application would not be happy about.

> Do > 128TB pointers matter so much that we should add this heuristic
> to prevent breakage, but little enough that we can accept some rare
> cases getting through? Genuine question.

At the end of the day what matters is if heuristic helps prevent breakage
of existing userspace and doesn't stay in the way of legitimate use of
full address space.

So far, it looks okay to me.

> > The idea was we shouldn't allow to slip above 47-bits by accidentally.
> > 
> > Correctly functioning program would never request addr+len above 47-bit
> > with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> > request would simply fail on machine that doesn't support large VA.
> > 
> > In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> > machine that doesn't support large VA, kernel will find another place
> > under 47-bit. And I can imagine a reasonable application that does
> > something like this.
> > 
> > So we cannot rely that application is ready to handle large
> > addresses if we see addr+len without MAP_FIXED.
> 
> By the same logic, a request for addr > 128TB without MAP_FIXED will
> not fail, therefore we can't rely on that either.
> 
> Or an app that links to a library that attempts MAP_FIXED allocation
> of addr + len above 128TB might use high bits of pointer returned by
> that library because those are never satisfied today and the library
> would fall back.

If you want to point that it's ABI break, yes it is.

But we allow ABI break as long as nobody notices. I think it's reasonable
to expect that nobody relies on such corner cases.

If we would find any piece of software affect by the change we would need
to reconsider.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 12:28                       ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 12:28 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 10:56:36PM +1100, Nicholas Piggin wrote:
> > No, it won't. You will hit stack first.
> 
> I guess so. Florian's bug didn't crash there for some reason, okay
> but I suppose my point about brk is not exactly where the standard
> heap is, but the pattern of allocations. An allocator that uses
> mmap for managing its address space might do the same thing, e.g.,
> incrementally expand existing mmaps as necessary.

With MAP_FIXED? I don't think so.

> > > Second, the kernel can never completely solve the problem this way.
> > > How do we know a malloc library will not ask for > 128TB addresses
> > > and pass them to an unknowing application?  
> > 
> > The idea is that an application can provide hint (mallopt() ?) to malloc
> > implementation that it's ready to full address space. In this case, malloc
> > can use mmap((void *) -1,...) for its allocations and get full address
> > space this way.
> 
> Point is, there's nothing stopping an allocator library or runtime
> from asking for mmap anywhere and returning it to userspace.

Right. Nobody would stop it from doing stupid things. There are many
things that a library may do that application would not be happy about.

> Do > 128TB pointers matter so much that we should add this heuristic
> to prevent breakage, but little enough that we can accept some rare
> cases getting through? Genuine question.

At the end of the day what matters is if heuristic helps prevent breakage
of existing userspace and doesn't stay in the way of legitimate use of
full address space.

So far, it looks okay to me.

> > The idea was we shouldn't allow to slip above 47-bits by accidentally.
> > 
> > Correctly functioning program would never request addr+len above 47-bit
> > with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> > request would simply fail on machine that doesn't support large VA.
> > 
> > In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> > machine that doesn't support large VA, kernel will find another place
> > under 47-bit. And I can imagine a reasonable application that does
> > something like this.
> > 
> > So we cannot rely that application is ready to handle large
> > addresses if we see addr+len without MAP_FIXED.
> 
> By the same logic, a request for addr > 128TB without MAP_FIXED will
> not fail, therefore we can't rely on that either.
> 
> Or an app that links to a library that attempts MAP_FIXED allocation
> of addr + len above 128TB might use high bits of pointer returned by
> that library because those are never satisfied today and the library
> would fall back.

If you want to point that it's ABI break, yes it is.

But we allow ABI break as long as nobody notices. I think it's reasonable
to expect that nobody relies on such corner cases.

If we would find any piece of software affect by the change we would need
to reconsider.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:44                       ` Kirill A. Shutemov
  (?)
@ 2017-11-07 13:05                       ` Florian Weimer
  2017-11-07 13:16                           ` Kirill A. Shutemov
  -1 siblings, 1 reply; 71+ messages in thread
From: Florian Weimer @ 2017-11-07 13:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
> On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>>
>>>> First of all, using addr and MAP_FIXED to develop our heuristic can
>>>> never really give unchanged ABI. It's an in-band signal. brk() is a
>>>> good example that steadily keeps incrementing address, so depending
>>>> on malloc usage and address space randomization, you will get a brk()
>>>> that ends exactly at 128T, then the next one will be >
>>>> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>>>
>>> No, it won't. You will hit stack first.
>>
>> That's not actually true on POWER in some cases.  See the process maps I
>> posted here:
>>
>>    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
> 
> Hm? I see that in all three cases the [stack] is the last mapping.
> Do I miss something?

Hah, I had not noticed.  Occasionally, the order of heap and stack is 
reversed.  This happens in approximately 15% of the runs.

See the attached example.

Thanks,
Florian

[-- Attachment #2: maps.txt --]
[-- Type: text/plain, Size: 2684 bytes --]

7fffacc50000-7fffacc90000 rw-p 00000000 00:00 0 
7fffacc90000-7fffaccf0000 r--p 00000000 fd:00 25167925                   /usr/lib/locale/en_US.utf8/LC_CTYPE
7fffaccf0000-7fffacd00000 r--p 00000000 fd:00 25167928                   /usr/lib/locale/en_US.utf8/LC_NUMERIC
7fffacd00000-7fffacd10000 r--p 00000000 fd:00 16798929                   /usr/lib/locale/en_US.utf8/LC_TIME
7fffacd10000-7ffface40000 r--p 00000000 fd:00 25167924                   /usr/lib/locale/en_US.utf8/LC_COLLATE
7ffface40000-7ffface50000 r--p 00000000 fd:00 16798927                   /usr/lib/locale/en_US.utf8/LC_MONETARY
7ffface50000-7ffface60000 r--p 00000000 fd:00 2511                       /usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7ffface60000-7ffface70000 r--p 00000000 fd:00 16798942                   /usr/lib/locale/en_US.utf8/LC_PAPER
7ffface70000-7ffface80000 r--p 00000000 fd:00 25167927                   /usr/lib/locale/en_US.utf8/LC_NAME
7ffface80000-7ffface90000 r--p 00000000 fd:00 16798924                   /usr/lib/locale/en_US.utf8/LC_ADDRESS
7ffface90000-7fffacea0000 r--p 00000000 fd:00 16798928                   /usr/lib/locale/en_US.utf8/LC_TELEPHONE
7fffacea0000-7fffaceb0000 r--p 00000000 fd:00 16798926                   /usr/lib/locale/en_US.utf8/LC_MEASUREMENT
7fffaceb0000-7fffacec0000 r--s 00000000 fd:00 8390657                    /usr/lib64/gconv/gconv-modules.cache
7fffacec0000-7fffad0d0000 r-xp 00000000 fd:00 8390335                    /usr/lib64/libc-2.25.so
7fffad0d0000-7fffad0e0000 ---p 00210000 fd:00 8390335                    /usr/lib64/libc-2.25.so
7fffad0e0000-7fffad0f0000 r--p 00210000 fd:00 8390335                    /usr/lib64/libc-2.25.so
7fffad0f0000-7fffad100000 rw-p 00220000 fd:00 8390335                    /usr/lib64/libc-2.25.so
7fffad100000-7fffad110000 r--p 00000000 fd:00 16798925                   /usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
7fffad110000-7fffad120000 r-xp 00000000 fd:00 63543                      /usr/bin/cat
7fffad120000-7fffad130000 r--p 00000000 fd:00 63543                      /usr/bin/cat
7fffad130000-7fffad140000 rw-p 00010000 fd:00 63543                      /usr/bin/cat
7fffad140000-7fffad160000 r-xp 00000000 00:00 0                          [vdso]
7fffad160000-7fffad1a0000 r-xp 00000000 fd:00 8390328                    /usr/lib64/ld-2.25.so
7fffad1a0000-7fffad1b0000 r--p 00030000 fd:00 8390328                    /usr/lib64/ld-2.25.so
7fffad1b0000-7fffad1c0000 rw-p 00040000 fd:00 8390328                    /usr/lib64/ld-2.25.so
7fffc2cf0000-7fffc2d20000 rw-p 00000000 00:00 0                          [stack]
7fffc8c10000-7fffc8c40000 rw-p 00000000 00:00 0                          [heap]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:05                       ` Florian Weimer
@ 2017-11-07 13:16                           ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 13:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> > > 
> > > > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > > > good example that steadily keeps incrementing address, so depending
> > > > > on malloc usage and address space randomization, you will get a brk()
> > > > > that ends exactly at 128T, then the next one will be >
> > > > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > > > 
> > > > No, it won't. You will hit stack first.
> > > 
> > > That's not actually true on POWER in some cases.  See the process maps I
> > > posted here:
> > > 
> > >    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
> > 
> > Hm? I see that in all three cases the [stack] is the last mapping.
> > Do I miss something?
> 
> Hah, I had not noticed.  Occasionally, the order of heap and stack is
> reversed.  This happens in approximately 15% of the runs.

Heh. I guess ASLR on Power is too fancy :)

That's strange layout. It doesn't give that much (relatively speaking)
virtual address space for both stack and heap to grow.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 13:16                           ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 13:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> > > 
> > > > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > > > good example that steadily keeps incrementing address, so depending
> > > > > on malloc usage and address space randomization, you will get a brk()
> > > > > that ends exactly at 128T, then the next one will be >
> > > > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > > > 
> > > > No, it won't. You will hit stack first.
> > > 
> > > That's not actually true on POWER in some cases.  See the process maps I
> > > posted here:
> > > 
> > >    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
> > 
> > Hm? I see that in all three cases the [stack] is the last mapping.
> > Do I miss something?
> 
> Hah, I had not noticed.  Occasionally, the order of heap and stack is
> reversed.  This happens in approximately 15% of the runs.

Heh. I guess ASLR on Power is too fancy :)

That's strange layout. It doesn't give that much (relatively speaking)
virtual address space for both stack and heap to grow.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 12:28                       ` Kirill A. Shutemov
@ 2017-11-07 13:33                         ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-07 13:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 15:28:25 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 10:56:36PM +1100, Nicholas Piggin wrote:
> > > No, it won't. You will hit stack first.  
> > 
> > I guess so. Florian's bug didn't crash there for some reason, okay
> > but I suppose my point about brk is not exactly where the standard
> > heap is, but the pattern of allocations. An allocator that uses
> > mmap for managing its address space might do the same thing, e.g.,
> > incrementally expand existing mmaps as necessary.  
> 
> With MAP_FIXED? I don't think so.

brk() based allocator effectively usees MAP_FIXED. If you know where
your addresses are, using MAP_FIXED can be used.

But okay let's ignore MAP_FIXED as a corner case. Then what happens
with !MAP_FIXED when an allocation ends exactly at 128TB and then next
one begins at 128TB? Won't that expand the address space? Should we
ignore that corner case too?

> 
> > > > Second, the kernel can never completely solve the problem this way.
> > > > How do we know a malloc library will not ask for > 128TB addresses
> > > > and pass them to an unknowing application?    
> > > 
> > > The idea is that an application can provide hint (mallopt() ?) to malloc
> > > implementation that it's ready to full address space. In this case, malloc
> > > can use mmap((void *) -1,...) for its allocations and get full address
> > > space this way.  
> > 
> > Point is, there's nothing stopping an allocator library or runtime
> > from asking for mmap anywhere and returning it to userspace.  
> 
> Right. Nobody would stop it from doing stupid things. There are many
> things that a library may do that application would not be happy about.

Indeed.

> 
> > Do > 128TB pointers matter so much that we should add this heuristic
> > to prevent breakage, but little enough that we can accept some rare
> > cases getting through? Genuine question.  
> 
> At the end of the day what matters is if heuristic helps prevent breakage
> of existing userspace and doesn't stay in the way of legitimate use of
> full address space.
> 
> So far, it looks okay to me.

Well that wasn't really the point of my question. Yes of course that
is important. But the question is how are these heuristics chosen and
evaluated? Why is this a good change to make?

We've decided there is some benefit from preventing 128TB pointers, but
also not enough that we have to completely prevent them accidentally
being returned. Yet it's important enough to make mmap behaviour diverge
in about 5 ways around 128TB depending on what combination of address
and length and MAP_FIXED you specify, and introducing this new way to
use the interface to get an expanded address space?

And we're doing this because we don't want to add a prctl or personality
or whatever for SAP HANA, because it was decided that would be too complex?

> 
> > > The idea was we shouldn't allow to slip above 47-bits by accidentally.
> > > 
> > > Correctly functioning program would never request addr+len above 47-bit
> > > with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> > > request would simply fail on machine that doesn't support large VA.
> > > 
> > > In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> > > machine that doesn't support large VA, kernel will find another place
> > > under 47-bit. And I can imagine a reasonable application that does
> > > something like this.
> > > 
> > > So we cannot rely that application is ready to handle large
> > > addresses if we see addr+len without MAP_FIXED.  
> > 
> > By the same logic, a request for addr > 128TB without MAP_FIXED will
> > not fail, therefore we can't rely on that either.
> > 
> > Or an app that links to a library that attempts MAP_FIXED allocation
> > of addr + len above 128TB might use high bits of pointer returned by
> > that library because those are never satisfied today and the library
> > would fall back.  
> 
> If you want to point that it's ABI break, yes it is.
> 
> But we allow ABI break as long as nobody notices. I think it's reasonable
> to expect that nobody relies on such corner cases.

I accept the point, but I do worry mmap syscall is very fundamental,
and that it is used in a lot of ways that are difficult to foresee,
so there's a lot of room for unintended consequences. Plus the state
of the code (that's not to pick on x86, powerpc is worse) is worrying.

> 
> If we would find any piece of software affect by the change we would need
> to reconsider.
> 

Problem is that if there was an issue caused by this, it's unlikely
to be found until a long time later. By that time, hopefully there
aren't too many other apps that are now rely on the behaviour if it
has to be changed.

I just don't see why that's a risk worth taking at all. I can't see how
the cost benefit is there. I have to confess I could have been more
involved in discussion, but is any other interface honestly too complex
to add?

If it is decided to keep these kind of heuristics, can we get just a
small but reasonably precise description of each change to the
interface and ways for using the new functionality, such that would be
suitable for the man page? I couldn't fix powerpc because nothing
matches and even Aneesh and you differ on some details (MAP_FIXED
behaviour).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 13:33                         ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-07 13:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 15:28:25 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 10:56:36PM +1100, Nicholas Piggin wrote:
> > > No, it won't. You will hit stack first.  
> > 
> > I guess so. Florian's bug didn't crash there for some reason, okay
> > but I suppose my point about brk is not exactly where the standard
> > heap is, but the pattern of allocations. An allocator that uses
> > mmap for managing its address space might do the same thing, e.g.,
> > incrementally expand existing mmaps as necessary.  
> 
> With MAP_FIXED? I don't think so.

brk() based allocator effectively usees MAP_FIXED. If you know where
your addresses are, using MAP_FIXED can be used.

But okay let's ignore MAP_FIXED as a corner case. Then what happens
with !MAP_FIXED when an allocation ends exactly at 128TB and then next
one begins at 128TB? Won't that expand the address space? Should we
ignore that corner case too?

> 
> > > > Second, the kernel can never completely solve the problem this way.
> > > > How do we know a malloc library will not ask for > 128TB addresses
> > > > and pass them to an unknowing application?    
> > > 
> > > The idea is that an application can provide hint (mallopt() ?) to malloc
> > > implementation that it's ready to full address space. In this case, malloc
> > > can use mmap((void *) -1,...) for its allocations and get full address
> > > space this way.  
> > 
> > Point is, there's nothing stopping an allocator library or runtime
> > from asking for mmap anywhere and returning it to userspace.  
> 
> Right. Nobody would stop it from doing stupid things. There are many
> things that a library may do that application would not be happy about.

Indeed.

> 
> > Do > 128TB pointers matter so much that we should add this heuristic
> > to prevent breakage, but little enough that we can accept some rare
> > cases getting through? Genuine question.  
> 
> At the end of the day what matters is if heuristic helps prevent breakage
> of existing userspace and doesn't stay in the way of legitimate use of
> full address space.
> 
> So far, it looks okay to me.

Well that wasn't really the point of my question. Yes of course that
is important. But the question is how are these heuristics chosen and
evaluated? Why is this a good change to make?

We've decided there is some benefit from preventing 128TB pointers, but
also not enough that we have to completely prevent them accidentally
being returned. Yet it's important enough to make mmap behaviour diverge
in about 5 ways around 128TB depending on what combination of address
and length and MAP_FIXED you specify, and introducing this new way to
use the interface to get an expanded address space?

And we're doing this because we don't want to add a prctl or personality
or whatever for SAP HANA, because it was decided that would be too complex?

> 
> > > The idea was we shouldn't allow to slip above 47-bits by accidentally.
> > > 
> > > Correctly functioning program would never request addr+len above 47-bit
> > > with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> > > request would simply fail on machine that doesn't support large VA.
> > > 
> > > In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> > > machine that doesn't support large VA, kernel will find another place
> > > under 47-bit. And I can imagine a reasonable application that does
> > > something like this.
> > > 
> > > So we cannot rely that application is ready to handle large
> > > addresses if we see addr+len without MAP_FIXED.  
> > 
> > By the same logic, a request for addr > 128TB without MAP_FIXED will
> > not fail, therefore we can't rely on that either.
> > 
> > Or an app that links to a library that attempts MAP_FIXED allocation
> > of addr + len above 128TB might use high bits of pointer returned by
> > that library because those are never satisfied today and the library
> > would fall back.  
> 
> If you want to point that it's ABI break, yes it is.
> 
> But we allow ABI break as long as nobody notices. I think it's reasonable
> to expect that nobody relies on such corner cases.

I accept the point, but I do worry mmap syscall is very fundamental,
and that it is used in a lot of ways that are difficult to foresee,
so there's a lot of room for unintended consequences. Plus the state
of the code (that's not to pick on x86, powerpc is worse) is worrying.

> 
> If we would find any piece of software affect by the change we would need
> to reconsider.
> 

Problem is that if there was an issue caused by this, it's unlikely
to be found until a long time later. By that time, hopefully there
aren't too many other apps that are now rely on the behaviour if it
has to be changed.

I just don't see why that's a risk worth taking at all. I can't see how
the cost benefit is there. I have to confess I could have been more
involved in discussion, but is any other interface honestly too complex
to add?

If it is decided to keep these kind of heuristics, can we get just a
small but reasonably precise description of each change to the
interface and ways for using the new functionality, such that would be
suitable for the man page? I couldn't fix powerpc because nothing
matches and even Aneesh and you differ on some details (MAP_FIXED
behaviour).

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:33                         ` Nicholas Piggin
@ 2017-11-07 13:45                           ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 71+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-07 13:45 UTC (permalink / raw)
  To: Nicholas Piggin, Kirill A. Shutemov
  Cc: Florian Weimer, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List


> 
> If it is decided to keep these kind of heuristics, can we get just a
> small but reasonably precise description of each change to the
> interface and ways for using the new functionality, such that would be
> suitable for the man page? I couldn't fix powerpc because nothing
> matches and even Aneesh and you differ on some details (MAP_FIXED
> behaviour).


I would consider MAP_FIXED as my mistake. We never discussed this 
explicitly and I kind of assumed it to behave the same way. ie, we 
search in lower address space (128TB) if the hint addr is below 128TB.

IIUC we agree on the below.

1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
but hint_addr + len is > 128TB.

2) For everything else we search in < 128TB space if hint addr is below 
128TB

3) We don't switch to large address space if hint_addr + len > 128TB. 
The decision to switch to large address space is primarily based on hint 
addr

Is there any other rule we need to outline? Or is any of the above not 
correct?

-aneesh

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 13:45                           ` Aneesh Kumar K.V
  0 siblings, 0 replies; 71+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-07 13:45 UTC (permalink / raw)
  To: Nicholas Piggin, Kirill A. Shutemov
  Cc: Florian Weimer, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List


> 
> If it is decided to keep these kind of heuristics, can we get just a
> small but reasonably precise description of each change to the
> interface and ways for using the new functionality, such that would be
> suitable for the man page? I couldn't fix powerpc because nothing
> matches and even Aneesh and you differ on some details (MAP_FIXED
> behaviour).


I would consider MAP_FIXED as my mistake. We never discussed this 
explicitly and I kind of assumed it to behave the same way. ie, we 
search in lower address space (128TB) if the hint addr is below 128TB.

IIUC we agree on the below.

1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
but hint_addr + len is > 128TB.

2) For everything else we search in < 128TB space if hint addr is below 
128TB

3) We don't switch to large address space if hint_addr + len > 128TB. 
The decision to switch to large address space is primarily based on hint 
addr

Is there any other rule we need to outline? Or is any of the above not 
correct?

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:45                           ` Aneesh Kumar K.V
@ 2017-11-07 14:01                             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 14:01 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Nicholas Piggin, Florian Weimer, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 07:15:58PM +0530, Aneesh Kumar K.V wrote:
> 
> > 
> > If it is decided to keep these kind of heuristics, can we get just a
> > small but reasonably precise description of each change to the
> > interface and ways for using the new functionality, such that would be
> > suitable for the man page? I couldn't fix powerpc because nothing
> > matches and even Aneesh and you differ on some details (MAP_FIXED
> > behaviour).
> 
> 
> I would consider MAP_FIXED as my mistake. We never discussed this explicitly
> and I kind of assumed it to behave the same way. ie, we search in lower
> address space (128TB) if the hint addr is below 128TB.
> 
> IIUC we agree on the below.
> 
> 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB but
> hint_addr + len is > 128TB.
> 
> 2) For everything else we search in < 128TB space if hint addr is below
> 128TB
> 
> 3) We don't switch to large address space if hint_addr + len > 128TB. The
> decision to switch to large address space is primarily based on hint addr
> 
> Is there any other rule we need to outline? Or is any of the above not
> correct?

That's correct.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-07 14:01                             ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 14:01 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Nicholas Piggin, Florian Weimer, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 07:15:58PM +0530, Aneesh Kumar K.V wrote:
> 
> > 
> > If it is decided to keep these kind of heuristics, can we get just a
> > small but reasonably precise description of each change to the
> > interface and ways for using the new functionality, such that would be
> > suitable for the man page? I couldn't fix powerpc because nothing
> > matches and even Aneesh and you differ on some details (MAP_FIXED
> > behaviour).
> 
> 
> I would consider MAP_FIXED as my mistake. We never discussed this explicitly
> and I kind of assumed it to behave the same way. ie, we search in lower
> address space (128TB) if the hint addr is below 128TB.
> 
> IIUC we agree on the below.
> 
> 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB but
> hint_addr + len is > 128TB.
> 
> 2) For everything else we search in < 128TB space if hint addr is below
> 128TB
> 
> 3) We don't switch to large address space if hint_addr + len > 128TB. The
> decision to switch to large address space is primarily based on hint addr
> 
> Is there any other rule we need to outline? Or is any of the above not
> correct?

That's correct.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:45                           ` Aneesh Kumar K.V
@ 2017-11-08  4:56                             ` Michael Ellerman
  -1 siblings, 0 replies; 71+ messages in thread
From: Michael Ellerman @ 2017-11-08  4:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Nicholas Piggin, Kirill A. Shutemov
  Cc: Florian Weimer, linux-arch, Dave Hansen, Peter Zijlstra,
	Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Andy Lutomirski, linux-mm, Andrew Morton, linuxppc-dev,
	Thomas Gleixner, Kirill A. Shutemov

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

>> 
>> If it is decided to keep these kind of heuristics, can we get just a
>> small but reasonably precise description of each change to the
>> interface and ways for using the new functionality, such that would be
>> suitable for the man page? I couldn't fix powerpc because nothing
>> matches and even Aneesh and you differ on some details (MAP_FIXED
>> behaviour).
>
>
> I would consider MAP_FIXED as my mistake. We never discussed this 
> explicitly and I kind of assumed it to behave the same way. ie, we 
> search in lower address space (128TB) if the hint addr is below 128TB.
>
> IIUC we agree on the below.
>
> 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
> but hint_addr + len is > 128TB.

So:
  mmap(0x7ffffffff000, 0x2000, ..., MAP_FIXED ...) = 0x7ffffffff000

> 2) For everything else we search in < 128TB space if hint addr is below 
> 128TB

  mmap((x < 128T), 0x1000, ...) = (y < 128T)
  ...
  mmap(0x7ffffffff000, 0x1000, ...) = 0x7ffffffff000
  mmap(0x800000000000, 0x1000, ...) = 0x800000000000
  ...
  mmap((x >= 128T), 0x1000, ...) = (y >= 128T)

> 3) We don't switch to large address space if hint_addr + len > 128TB. 
> The decision to switch to large address space is primarily based on hint 
> addr

But does the mmap succeed in that case or not?

ie:  mmap(0x7ffffffff000, 0x2000, ...) = ?

cheers

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-08  4:56                             ` Michael Ellerman
  0 siblings, 0 replies; 71+ messages in thread
From: Michael Ellerman @ 2017-11-08  4:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Nicholas Piggin, Kirill A. Shutemov
  Cc: Florian Weimer, linux-arch, Dave Hansen, Peter Zijlstra,
	Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Andy Lutomirski, linux-mm, Andrew Morton, linuxppc-dev,
	Thomas Gleixner, Kirill A. Shutemov

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

>> 
>> If it is decided to keep these kind of heuristics, can we get just a
>> small but reasonably precise description of each change to the
>> interface and ways for using the new functionality, such that would be
>> suitable for the man page? I couldn't fix powerpc because nothing
>> matches and even Aneesh and you differ on some details (MAP_FIXED
>> behaviour).
>
>
> I would consider MAP_FIXED as my mistake. We never discussed this 
> explicitly and I kind of assumed it to behave the same way. ie, we 
> search in lower address space (128TB) if the hint addr is below 128TB.
>
> IIUC we agree on the below.
>
> 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
> but hint_addr + len is > 128TB.

So:
  mmap(0x7ffffffff000, 0x2000, ..., MAP_FIXED ...) = 0x7ffffffff000

> 2) For everything else we search in < 128TB space if hint addr is below 
> 128TB

  mmap((x < 128T), 0x1000, ...) = (y < 128T)
  ...
  mmap(0x7ffffffff000, 0x1000, ...) = 0x7ffffffff000
  mmap(0x800000000000, 0x1000, ...) = 0x800000000000
  ...
  mmap((x >= 128T), 0x1000, ...) = (y >= 128T)

> 3) We don't switch to large address space if hint_addr + len > 128TB. 
> The decision to switch to large address space is primarily based on hint 
> addr

But does the mmap succeed in that case or not?

ie:  mmap(0x7ffffffff000, 0x2000, ...) = ?

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:16                           ` Kirill A. Shutemov
  (?)
@ 2017-11-08  6:08                             ` Michael Ellerman
  -1 siblings, 0 replies; 71+ messages in thread
From: Michael Ellerman @ 2017-11-08  6:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Florian Weimer, Kees Cook
  Cc: linux-arch, Dave Hansen, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, Linux Kernel Mailing List, Nicholas Piggin,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Andrew Morton,
	linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
>> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
>> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>> > > 
>> > > > > First of all, using addr and MAP_FIXED to develop our heuristic can
>> > > > > never really give unchanged ABI. It's an in-band signal. brk() is a
>> > > > > good example that steadily keeps incrementing address, so depending
>> > > > > on malloc usage and address space randomization, you will get a brk()
>> > > > > that ends exactly at 128T, then the next one will be >
>> > > > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>> > > > 
>> > > > No, it won't. You will hit stack first.
>> > > 
>> > > That's not actually true on POWER in some cases.  See the process maps I
>> > > posted here:
>> > > 
>> > >    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
>> > 
>> > Hm? I see that in all three cases the [stack] is the last mapping.
>> > Do I miss something?
>> 
>> Hah, I had not noticed.  Occasionally, the order of heap and stack is
>> reversed.  This happens in approximately 15% of the runs.
>
> Heh. I guess ASLR on Power is too fancy :)

Fancy implies we're doing it on purpose :P

> That's strange layout. It doesn't give that much (relatively speaking)
> virtual address space for both stack and heap to grow.

I'm pretty sure it only happens when you're running an ELF interpreter
directly, because of Kees patch which changed the logic to load ELF
interpreters in the mmap region, vs PIE binaries which go to
ELF_ET_DYN_BASE. (eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only
for PIE"))

It only happens with ASLR enabled. Presumably it's because our brk_rnd()
is overly aggressive in this case, it randomises up to 1GB, and the heap
jumps over the stack.

cheers

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-08  6:08                             ` Michael Ellerman
  0 siblings, 0 replies; 71+ messages in thread
From: Michael Ellerman @ 2017-11-08  6:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Florian Weimer, Kees Cook
  Cc: linux-arch, Dave Hansen, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, Linux Kernel Mailing List, Nicholas Piggin,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Andrew Morton,
	linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
>> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
>> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>> > > 
>> > > > > First of all, using addr and MAP_FIXED to develop our heuristic can
>> > > > > never really give unchanged ABI. It's an in-band signal. brk() is a
>> > > > > good example that steadily keeps incrementing address, so depending
>> > > > > on malloc usage and address space randomization, you will get a brk()
>> > > > > that ends exactly at 128T, then the next one will be >
>> > > > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>> > > > 
>> > > > No, it won't. You will hit stack first.
>> > > 
>> > > That's not actually true on POWER in some cases.  See the process maps I
>> > > posted here:
>> > > 
>> > >    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
>> > 
>> > Hm? I see that in all three cases the [stack] is the last mapping.
>> > Do I miss something?
>> 
>> Hah, I had not noticed.  Occasionally, the order of heap and stack is
>> reversed.  This happens in approximately 15% of the runs.
>
> Heh. I guess ASLR on Power is too fancy :)

Fancy implies we're doing it on purpose :P

> That's strange layout. It doesn't give that much (relatively speaking)
> virtual address space for both stack and heap to grow.

I'm pretty sure it only happens when you're running an ELF interpreter
directly, because of Kees patch which changed the logic to load ELF
interpreters in the mmap region, vs PIE binaries which go to
ELF_ET_DYN_BASE. (eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only
for PIE"))

It only happens with ASLR enabled. Presumably it's because our brk_rnd()
is overly aggressive in this case, it randomises up to 1GB, and the heap
jumps over the stack.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-08  6:08                             ` Michael Ellerman
  0 siblings, 0 replies; 71+ messages in thread
From: Michael Ellerman @ 2017-11-08  6:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Florian Weimer, Kees Cook
  Cc: linux-arch, Dave Hansen, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, Linux Kernel Mailing List, Nicholas Piggin,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Andrew Morton,
	linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
>> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
>> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>> > > 
>> > > > > First of all, using addr and MAP_FIXED to develop our heuristic can
>> > > > > never really give unchanged ABI. It's an in-band signal. brk() is a
>> > > > > good example that steadily keeps incrementing address, so depending
>> > > > > on malloc usage and address space randomization, you will get a brk()
>> > > > > that ends exactly at 128T, then the next one will be >
>> > > > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>> > > > 
>> > > > No, it won't. You will hit stack first.
>> > > 
>> > > That's not actually true on POWER in some cases.  See the process maps I
>> > > posted here:
>> > > 
>> > >    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
>> > 
>> > Hm? I see that in all three cases the [stack] is the last mapping.
>> > Do I miss something?
>> 
>> Hah, I had not noticed.  Occasionally, the order of heap and stack is
>> reversed.  This happens in approximately 15% of the runs.
>
> Heh. I guess ASLR on Power is too fancy :)

Fancy implies we're doing it on purpose :P

> That's strange layout. It doesn't give that much (relatively speaking)
> virtual address space for both stack and heap to grow.

I'm pretty sure it only happens when you're running an ELF interpreter
directly, because of Kees patch which changed the logic to load ELF
interpreters in the mmap region, vs PIE binaries which go to
ELF_ET_DYN_BASE. (eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only
for PIE"))

It only happens with ASLR enabled. Presumably it's because our brk_rnd()
is overly aggressive in this case, it randomises up to 1GB, and the heap
jumps over the stack.

cheers

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-08  6:08                             ` Michael Ellerman
@ 2017-11-08  6:18                               ` Florian Weimer
  -1 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-08  6:18 UTC (permalink / raw)
  To: Michael Ellerman, Kirill A. Shutemov, Kees Cook
  Cc: linux-arch, Dave Hansen, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, Linux Kernel Mailing List, Nicholas Piggin,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Andrew Morton,
	linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On 11/08/2017 07:08 AM, Michael Ellerman wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
>> On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
>>> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
>>>> On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>>>>> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>>>>>
>>>>>>> First of all, using addr and MAP_FIXED to develop our heuristic can
>>>>>>> never really give unchanged ABI. It's an in-band signal. brk() is a
>>>>>>> good example that steadily keeps incrementing address, so depending
>>>>>>> on malloc usage and address space randomization, you will get a brk()
>>>>>>> that ends exactly at 128T, then the next one will be >
>>>>>>> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>>>>>>
>>>>>> No, it won't. You will hit stack first.
>>>>>
>>>>> That's not actually true on POWER in some cases.  See the process maps I
>>>>> posted here:
>>>>>
>>>>>     <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
>>>>
>>>> Hm? I see that in all three cases the [stack] is the last mapping.
>>>> Do I miss something?
>>>
>>> Hah, I had not noticed.  Occasionally, the order of heap and stack is
>>> reversed.  This happens in approximately 15% of the runs.
>>
>> Heh. I guess ASLR on Power is too fancy :)
> 
> Fancy implies we're doing it on purpose :P
> 
>> That's strange layout. It doesn't give that much (relatively speaking)
>> virtual address space for both stack and heap to grow.
> 
> I'm pretty sure it only happens when you're running an ELF interpreter
> directly, because of Kees patch which changed the logic to load ELF
> interpreters in the mmap region, vs PIE binaries which go to
> ELF_ET_DYN_BASE. (eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only
> for PIE"))

 From that commit:

+                   * There are effectively two types of ET_DYN
+                   * binaries: programs (i.e. PIE: ET_DYN with INTERP)
+                   * and loaders (ET_DYN without INTERP, since they
+                   * _are_ the ELF interpreter). The loaders must

Note that the comment is a bit misleading: statically linked PIE 
binaries are ET_DYN without INTERP, too.

So any oddity which is observable today with an explicitly ld.so 
invocation only will gain more relevance once we get static PIE support 
in user space because it will then affect regular applications, too. 
(Well, statically linked ones.)  In this sense, process layouts which 
cause premature brk failure or insufficient stack allocations are real bugs.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-08  6:18                               ` Florian Weimer
  0 siblings, 0 replies; 71+ messages in thread
From: Florian Weimer @ 2017-11-08  6:18 UTC (permalink / raw)
  To: Michael Ellerman, Kirill A. Shutemov, Kees Cook
  Cc: linux-arch, Dave Hansen, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, Linux Kernel Mailing List, Nicholas Piggin,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Andrew Morton,
	linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On 11/08/2017 07:08 AM, Michael Ellerman wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
>> On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
>>> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
>>>> On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>>>>> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>>>>>
>>>>>>> First of all, using addr and MAP_FIXED to develop our heuristic can
>>>>>>> never really give unchanged ABI. It's an in-band signal. brk() is a
>>>>>>> good example that steadily keeps incrementing address, so depending
>>>>>>> on malloc usage and address space randomization, you will get a brk()
>>>>>>> that ends exactly at 128T, then the next one will be >
>>>>>>> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>>>>>>
>>>>>> No, it won't. You will hit stack first.
>>>>>
>>>>> That's not actually true on POWER in some cases.  See the process maps I
>>>>> posted here:
>>>>>
>>>>>     <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
>>>>
>>>> Hm? I see that in all three cases the [stack] is the last mapping.
>>>> Do I miss something?
>>>
>>> Hah, I had not noticed.  Occasionally, the order of heap and stack is
>>> reversed.  This happens in approximately 15% of the runs.
>>
>> Heh. I guess ASLR on Power is too fancy :)
> 
> Fancy implies we're doing it on purpose :P
> 
>> That's strange layout. It doesn't give that much (relatively speaking)
>> virtual address space for both stack and heap to grow.
> 
> I'm pretty sure it only happens when you're running an ELF interpreter
> directly, because of Kees patch which changed the logic to load ELF
> interpreters in the mmap region, vs PIE binaries which go to
> ELF_ET_DYN_BASE. (eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only
> for PIE"))

 From that commit:

+                   * There are effectively two types of ET_DYN
+                   * binaries: programs (i.e. PIE: ET_DYN with INTERP)
+                   * and loaders (ET_DYN without INTERP, since they
+                   * _are_ the ELF interpreter). The loaders must

Note that the comment is a bit misleading: statically linked PIE 
binaries are ET_DYN without INTERP, too.

So any oddity which is observable today with an explicitly ld.so 
invocation only will gain more relevance once we get static PIE support 
in user space because it will then affect regular applications, too. 
(Well, statically linked ones.)  In this sense, process layouts which 
cause premature brk failure or insufficient stack allocations are real bugs.

Thanks,
Florian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-08  4:56                             ` Michael Ellerman
@ 2017-11-08  8:30                               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-08  8:30 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Aneesh Kumar K.V, Nicholas Piggin, Florian Weimer, linux-arch,
	Dave Hansen, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Andy Lutomirski, linux-mm,
	Andrew Morton, linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On Wed, Nov 08, 2017 at 03:56:06PM +1100, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:
> 
> >> 
> >> If it is decided to keep these kind of heuristics, can we get just a
> >> small but reasonably precise description of each change to the
> >> interface and ways for using the new functionality, such that would be
> >> suitable for the man page? I couldn't fix powerpc because nothing
> >> matches and even Aneesh and you differ on some details (MAP_FIXED
> >> behaviour).
> >
> >
> > I would consider MAP_FIXED as my mistake. We never discussed this 
> > explicitly and I kind of assumed it to behave the same way. ie, we 
> > search in lower address space (128TB) if the hint addr is below 128TB.
> >
> > IIUC we agree on the below.
> >
> > 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
> > but hint_addr + len is > 128TB.
> 
> So:
>   mmap(0x7ffffffff000, 0x2000, ..., MAP_FIXED ...) = 0x7ffffffff000
> 
> > 2) For everything else we search in < 128TB space if hint addr is below 
> > 128TB
> 
>   mmap((x < 128T), 0x1000, ...) = (y < 128T)
>   ...
>   mmap(0x7ffffffff000, 0x1000, ...) = 0x7ffffffff000
>   mmap(0x800000000000, 0x1000, ...) = 0x800000000000
>   ...
>   mmap((x >= 128T), 0x1000, ...) = (y >= 128T)
> 
> > 3) We don't switch to large address space if hint_addr + len > 128TB. 
> > The decision to switch to large address space is primarily based on hint 
> > addr
> 
> But does the mmap succeed in that case or not?
> 
> ie:  mmap(0x7ffffffff000, 0x2000, ...) = ?

It does, but resulting address doesn't match the hint. It's somewhere
below 47-bit border.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-08  8:30                               ` Kirill A. Shutemov
  0 siblings, 0 replies; 71+ messages in thread
From: Kirill A. Shutemov @ 2017-11-08  8:30 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Aneesh Kumar K.V, Nicholas Piggin, Florian Weimer, linux-arch,
	Dave Hansen, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Andy Lutomirski, linux-mm,
	Andrew Morton, linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On Wed, Nov 08, 2017 at 03:56:06PM +1100, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:
> 
> >> 
> >> If it is decided to keep these kind of heuristics, can we get just a
> >> small but reasonably precise description of each change to the
> >> interface and ways for using the new functionality, such that would be
> >> suitable for the man page? I couldn't fix powerpc because nothing
> >> matches and even Aneesh and you differ on some details (MAP_FIXED
> >> behaviour).
> >
> >
> > I would consider MAP_FIXED as my mistake. We never discussed this 
> > explicitly and I kind of assumed it to behave the same way. ie, we 
> > search in lower address space (128TB) if the hint addr is below 128TB.
> >
> > IIUC we agree on the below.
> >
> > 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
> > but hint_addr + len is > 128TB.
> 
> So:
>   mmap(0x7ffffffff000, 0x2000, ..., MAP_FIXED ...) = 0x7ffffffff000
> 
> > 2) For everything else we search in < 128TB space if hint addr is below 
> > 128TB
> 
>   mmap((x < 128T), 0x1000, ...) = (y < 128T)
>   ...
>   mmap(0x7ffffffff000, 0x1000, ...) = 0x7ffffffff000
>   mmap(0x800000000000, 0x1000, ...) = 0x800000000000
>   ...
>   mmap((x >= 128T), 0x1000, ...) = (y >= 128T)
> 
> > 3) We don't switch to large address space if hint_addr + len > 128TB. 
> > The decision to switch to large address space is primarily based on hint 
> > addr
> 
> But does the mmap succeed in that case or not?
> 
> ie:  mmap(0x7ffffffff000, 0x2000, ...) = ?

It does, but resulting address doesn't match the hint. It's somewhere
below 47-bit border.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 14:01                             ` Kirill A. Shutemov
@ 2017-11-09 17:15                               ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-09 17:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Aneesh Kumar K.V, Florian Weimer, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 17:01:58 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 07:15:58PM +0530, Aneesh Kumar K.V wrote:
> >   
> > > 
> > > If it is decided to keep these kind of heuristics, can we get just a
> > > small but reasonably precise description of each change to the
> > > interface and ways for using the new functionality, such that would be
> > > suitable for the man page? I couldn't fix powerpc because nothing
> > > matches and even Aneesh and you differ on some details (MAP_FIXED
> > > behaviour).  
> > 
> > 
> > I would consider MAP_FIXED as my mistake. We never discussed this explicitly
> > and I kind of assumed it to behave the same way. ie, we search in lower
> > address space (128TB) if the hint addr is below 128TB.
> > 
> > IIUC we agree on the below.
> > 
> > 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB but
> > hint_addr + len is > 128TB.
> > 
> > 2) For everything else we search in < 128TB space if hint addr is below
> > 128TB
> > 
> > 3) We don't switch to large address space if hint_addr + len > 128TB. The
> > decision to switch to large address space is primarily based on hint addr
> > 
> > Is there any other rule we need to outline? Or is any of the above not
> > correct?  
> 
> That's correct.
> 

Thanks guys, I'll send out some powerpc patches to match -- it deviates in
its MAP_FIXED handling (treats it the same as !MAP_FIXED).

So these semantics are what we're going with? Anything that does mmap() is
guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
itself? Is intended to be cross-platform or just x86 and power specific?

Also, this may follow from deduction from 1-3, but for explicit
specification in man page:

4) To get an unspecified allocation with the largest possible address range,
we pass in -1 for mmap hint.

Are we allowing 8 bits bits of unused address in this case, or must the
app not assume anything about number of bits used?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-09 17:15                               ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-09 17:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Aneesh Kumar K.V, Florian Weimer, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 17:01:58 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 07:15:58PM +0530, Aneesh Kumar K.V wrote:
> >   
> > > 
> > > If it is decided to keep these kind of heuristics, can we get just a
> > > small but reasonably precise description of each change to the
> > > interface and ways for using the new functionality, such that would be
> > > suitable for the man page? I couldn't fix powerpc because nothing
> > > matches and even Aneesh and you differ on some details (MAP_FIXED
> > > behaviour).  
> > 
> > 
> > I would consider MAP_FIXED as my mistake. We never discussed this explicitly
> > and I kind of assumed it to behave the same way. ie, we search in lower
> > address space (128TB) if the hint addr is below 128TB.
> > 
> > IIUC we agree on the below.
> > 
> > 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB but
> > hint_addr + len is > 128TB.
> > 
> > 2) For everything else we search in < 128TB space if hint addr is below
> > 128TB
> > 
> > 3) We don't switch to large address space if hint_addr + len > 128TB. The
> > decision to switch to large address space is primarily based on hint addr
> > 
> > Is there any other rule we need to outline? Or is any of the above not
> > correct?  
> 
> That's correct.
> 

Thanks guys, I'll send out some powerpc patches to match -- it deviates in
its MAP_FIXED handling (treats it the same as !MAP_FIXED).

So these semantics are what we're going with? Anything that does mmap() is
guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
itself? Is intended to be cross-platform or just x86 and power specific?

Also, this may follow from deduction from 1-3, but for explicit
specification in man page:

4) To get an unspecified allocation with the largest possible address range,
we pass in -1 for mmap hint.

Are we allowing 8 bits bits of unused address in this case, or must the
app not assume anything about number of bits used?

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-09 17:15                               ` Nicholas Piggin
@ 2017-11-09 19:44                                 ` Matthew Wilcox
  -1 siblings, 0 replies; 71+ messages in thread
From: Matthew Wilcox @ 2017-11-09 19:44 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Kirill A. Shutemov, Aneesh Kumar K.V, Florian Weimer,
	Kirill A. Shutemov, linuxppc-dev, linux-mm, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> So these semantics are what we're going with? Anything that does mmap() is
> guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> itself? Is intended to be cross-platform or just x86 and power specific?

It is x86 and powerpc specific.  The arm64 people have apparently stumbled
across apps that expect to be able to use bit 48 for their own purposes.
And their address space is 48 bit by default.  Oops.

> Also, this may follow from deduction from 1-3, but for explicit
> specification in man page:
> 
> 4) To get an unspecified allocation with the largest possible address range,
> we pass in -1 for mmap hint.
> 
> Are we allowing 8 bits bits of unused address in this case, or must the
> app not assume anything about number of bits used?

Maybe document it as: "If the app wants to use the top N bits of addresses
for its own purposes, pass in (~0UL >> N) as the mmap hint."  ?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-09 19:44                                 ` Matthew Wilcox
  0 siblings, 0 replies; 71+ messages in thread
From: Matthew Wilcox @ 2017-11-09 19:44 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Kirill A. Shutemov, Aneesh Kumar K.V, Florian Weimer,
	Kirill A. Shutemov, linuxppc-dev, linux-mm, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> So these semantics are what we're going with? Anything that does mmap() is
> guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> itself? Is intended to be cross-platform or just x86 and power specific?

It is x86 and powerpc specific.  The arm64 people have apparently stumbled
across apps that expect to be able to use bit 48 for their own purposes.
And their address space is 48 bit by default.  Oops.

> Also, this may follow from deduction from 1-3, but for explicit
> specification in man page:
> 
> 4) To get an unspecified allocation with the largest possible address range,
> we pass in -1 for mmap hint.
> 
> Are we allowing 8 bits bits of unused address in this case, or must the
> app not assume anything about number of bits used?

Maybe document it as: "If the app wants to use the top N bits of addresses
for its own purposes, pass in (~0UL >> N) as the mmap hint."  ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-09 19:44                                 ` Matthew Wilcox
@ 2017-11-10  1:26                                   ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-10  1:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Aneesh Kumar K.V, Florian Weimer,
	Kirill A. Shutemov, linuxppc-dev, linux-mm, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Thu, 9 Nov 2017 11:44:21 -0800
Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> > So these semantics are what we're going with? Anything that does mmap() is
> > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > itself? Is intended to be cross-platform or just x86 and power specific?  
> 
> It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> across apps that expect to be able to use bit 48 for their own purposes.
> And their address space is 48 bit by default.  Oops.

Okay, so it's something we should make into an "official" API?

> 
> > Also, this may follow from deduction from 1-3, but for explicit
> > specification in man page:
> > 
> > 4) To get an unspecified allocation with the largest possible address range,
> > we pass in -1 for mmap hint.
> > 
> > Are we allowing 8 bits bits of unused address in this case, or must the
> > app not assume anything about number of bits used?  
> 
> Maybe document it as: "If the app wants to use the top N bits of addresses
> for its own purposes, pass in (~0UL >> N) as the mmap hint."  ?

Well we don't have code for that yet, but the problem would also be that
it succeeds, and actually it probably goes over the limit. So you would
have to map a dummy page there so subsequent hints to fail and fall back.
Not sure... it would be nice to be able to specify number of bits, but I
think this gets a bit hairy. -1 to use all bits seems a bit easier.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-10  1:26                                   ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-10  1:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Aneesh Kumar K.V, Florian Weimer,
	Kirill A. Shutemov, linuxppc-dev, linux-mm, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Thu, 9 Nov 2017 11:44:21 -0800
Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> > So these semantics are what we're going with? Anything that does mmap() is
> > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > itself? Is intended to be cross-platform or just x86 and power specific?  
> 
> It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> across apps that expect to be able to use bit 48 for their own purposes.
> And their address space is 48 bit by default.  Oops.

Okay, so it's something we should make into an "official" API?

> 
> > Also, this may follow from deduction from 1-3, but for explicit
> > specification in man page:
> > 
> > 4) To get an unspecified allocation with the largest possible address range,
> > we pass in -1 for mmap hint.
> > 
> > Are we allowing 8 bits bits of unused address in this case, or must the
> > app not assume anything about number of bits used?  
> 
> Maybe document it as: "If the app wants to use the top N bits of addresses
> for its own purposes, pass in (~0UL >> N) as the mmap hint."  ?

Well we don't have code for that yet, but the problem would also be that
it succeeds, and actually it probably goes over the limit. So you would
have to map a dummy page there so subsequent hints to fail and fall back.
Not sure... it would be nice to be able to specify number of bits, but I
think this gets a bit hairy. -1 to use all bits seems a bit easier.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-09 19:44                                 ` Matthew Wilcox
@ 2017-11-10 12:08                                   ` David Laight
  -1 siblings, 0 replies; 71+ messages in thread
From: David Laight @ 2017-11-10 12:08 UTC (permalink / raw)
  To: 'Matthew Wilcox', Nicholas Piggin
  Cc: Florian Weimer, linux-arch, linux-mm, Peter Zijlstra,
	linuxppc-dev, Linux Kernel Mailing List, Andy Lutomirski,
	Dave Hansen, Thomas Gleixner, Aneesh Kumar K.V,
	Kirill A. Shutemov, Andrew Morton, Linus Torvalds, Ingo Molnar,
	Kirill A. Shutemov

From: Matthew Wilcox
> Sent: 09 November 2017 19:44
> 
> On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> > So these semantics are what we're going with? Anything that does mmap() is
> > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > itself? Is intended to be cross-platform or just x86 and power specific?
> 
> It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> across apps that expect to be able to use bit 48 for their own purposes.
> And their address space is 48 bit by default.  Oops.

(Do you mean 49bit?)

Aren't such apps just doomed to be broken?

ISTR there is something on (IIRC) sparc64 that does a 'match'
on the high address bits to make it much harder to overrun
one area into another.

	David

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-10 12:08                                   ` David Laight
  0 siblings, 0 replies; 71+ messages in thread
From: David Laight @ 2017-11-10 12:08 UTC (permalink / raw)
  To: 'Matthew Wilcox', Nicholas Piggin
  Cc: Florian Weimer, linux-arch, Dave Hansen, Peter Zijlstra,
	Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Kirill A. Shutemov,
	Andrew Morton, linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

From: Matthew Wilcox
> Sent: 09 November 2017 19:44
>=20
> On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> > So these semantics are what we're going with? Anything that does mmap()=
 is
> > guaranteed of getting a 47-bit pointer and it can use the top 17 bits f=
or
> > itself? Is intended to be cross-platform or just x86 and power specific=
?
>=20
> It is x86 and powerpc specific.  The arm64 people have apparently stumble=
d
> across apps that expect to be able to use bit 48 for their own purposes.
> And their address space is 48 bit by default.  Oops.

(Do you mean 49bit?)

Aren't such apps just doomed to be broken?

ISTR there is something on (IIRC) sparc64 that does a 'match'
on the high address bits to make it much harder to overrun
one area into another.

	David

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-10 12:08                                   ` David Laight
@ 2017-11-11 10:30                                     ` Nicholas Piggin
  -1 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-11 10:30 UTC (permalink / raw)
  To: David Laight
  Cc: 'Matthew Wilcox',
	Florian Weimer, linux-arch, Dave Hansen, Peter Zijlstra,
	Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Kirill A. Shutemov,
	Andrew Morton, linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On Fri, 10 Nov 2017 12:08:35 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Matthew Wilcox
> > Sent: 09 November 2017 19:44
> > 
> > On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:  
> > > So these semantics are what we're going with? Anything that does mmap() is
> > > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > > itself? Is intended to be cross-platform or just x86 and power specific?  
> > 
> > It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> > across apps that expect to be able to use bit 48 for their own purposes.
> > And their address space is 48 bit by default.  Oops.  
> 
> (Do you mean 49bit?)

I think he meant bit 47, which makes sense because they were probably
ported from x86-64 with 47 bit address. That seems to be why x86-64
5-level and powerpc decided to limit to a 47 bit address space by
default.

> 
> Aren't such apps just doomed to be broken?

Well they're not portable but they are not broken if virtual address
is limited.

> 
> ISTR there is something on (IIRC) sparc64 that does a 'match'
> on the high address bits to make it much harder to overrun
> one area into another.

I'm not sure about that but I think the problem would be the app
masking out bits from the pointer for its own use before ever
dereferencing it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
@ 2017-11-11 10:30                                     ` Nicholas Piggin
  0 siblings, 0 replies; 71+ messages in thread
From: Nicholas Piggin @ 2017-11-11 10:30 UTC (permalink / raw)
  To: David Laight
  Cc: 'Matthew Wilcox',
	Florian Weimer, linux-arch, Dave Hansen, Peter Zijlstra,
	Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Kirill A. Shutemov,
	Andrew Morton, linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On Fri, 10 Nov 2017 12:08:35 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Matthew Wilcox
> > Sent: 09 November 2017 19:44
> > 
> > On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:  
> > > So these semantics are what we're going with? Anything that does mmap() is
> > > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > > itself? Is intended to be cross-platform or just x86 and power specific?  
> > 
> > It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> > across apps that expect to be able to use bit 48 for their own purposes.
> > And their address space is 48 bit by default.  Oops.  
> 
> (Do you mean 49bit?)

I think he meant bit 47, which makes sense because they were probably
ported from x86-64 with 47 bit address. That seems to be why x86-64
5-level and powerpc decided to limit to a 47 bit address space by
default.

> 
> Aren't such apps just doomed to be broken?

Well they're not portable but they are not broken if virtual address
is limited.

> 
> ISTR there is something on (IIRC) sparc64 that does a 'match'
> on the high address bits to make it much harder to overrun
> one area into another.

I'm not sure about that but I think the problem would be the app
masking out bits from the pointer for its own use before ever
dereferencing it.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2017-11-11 10:31 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-03 17:05 POWER: Unexpected fault when writing to brk-allocated memory Florian Weimer
2017-11-03 17:05 ` Florian Weimer
2017-11-05 12:18 ` Nicholas Piggin
2017-11-05 12:18   ` Nicholas Piggin
2017-11-05 12:35   ` Florian Weimer
2017-11-05 12:54     ` Nicholas Piggin
2017-11-05 12:54       ` Nicholas Piggin
2017-11-05 14:50   ` Florian Weimer
2017-11-05 14:50     ` Florian Weimer
2017-11-06  6:18   ` Aneesh Kumar K.V
2017-11-06  6:18     ` Aneesh Kumar K.V
2017-11-06  6:47     ` Nicholas Piggin
2017-11-06  6:47       ` Nicholas Piggin
2017-11-06  8:11       ` Florian Weimer
2017-11-06  8:11         ` Florian Weimer
2017-11-06  8:25         ` Nicholas Piggin
2017-11-06  8:25           ` Nicholas Piggin
2017-11-06  8:30           ` Aneesh Kumar K.V
2017-11-06  8:30             ` Aneesh Kumar K.V
2017-11-06  8:32             ` Florian Weimer
2017-11-06  8:32               ` Florian Weimer
2017-11-06 10:20               ` Nicholas Piggin
2017-11-06 10:20                 ` Nicholas Piggin
2017-11-07  5:07               ` Nicholas Piggin
2017-11-07  5:07                 ` Nicholas Piggin
2017-11-07  8:15                 ` Florian Weimer
2017-11-07  8:15                   ` Florian Weimer
2017-11-07  9:24                   ` Nicholas Piggin
2017-11-07  9:24                     ` Nicholas Piggin
2017-11-07 11:16                   ` Kirill A. Shutemov
2017-11-07 11:16                     ` Kirill A. Shutemov
2017-11-07 11:15                 ` Kirill A. Shutemov
2017-11-07 11:15                   ` Kirill A. Shutemov
2017-11-07 11:26                   ` Florian Weimer
2017-11-07 11:26                     ` Florian Weimer
2017-11-07 11:44                     ` Kirill A. Shutemov
2017-11-07 11:44                       ` Kirill A. Shutemov
2017-11-07 13:05                       ` Florian Weimer
2017-11-07 13:16                         ` Kirill A. Shutemov
2017-11-07 13:16                           ` Kirill A. Shutemov
2017-11-08  6:08                           ` Michael Ellerman
2017-11-08  6:08                             ` Michael Ellerman
2017-11-08  6:08                             ` Michael Ellerman
2017-11-08  6:18                             ` Florian Weimer
2017-11-08  6:18                               ` Florian Weimer
2017-11-07 11:56                   ` Nicholas Piggin
2017-11-07 11:56                     ` Nicholas Piggin
2017-11-07 12:28                     ` Kirill A. Shutemov
2017-11-07 12:28                       ` Kirill A. Shutemov
2017-11-07 13:33                       ` Nicholas Piggin
2017-11-07 13:33                         ` Nicholas Piggin
2017-11-07 13:45                         ` Aneesh Kumar K.V
2017-11-07 13:45                           ` Aneesh Kumar K.V
2017-11-07 14:01                           ` Kirill A. Shutemov
2017-11-07 14:01                             ` Kirill A. Shutemov
2017-11-09 17:15                             ` Nicholas Piggin
2017-11-09 17:15                               ` Nicholas Piggin
2017-11-09 19:44                               ` Matthew Wilcox
2017-11-09 19:44                                 ` Matthew Wilcox
2017-11-10  1:26                                 ` Nicholas Piggin
2017-11-10  1:26                                   ` Nicholas Piggin
2017-11-10 12:08                                 ` David Laight
2017-11-10 12:08                                   ` David Laight
2017-11-11 10:30                                   ` Nicholas Piggin
2017-11-11 10:30                                     ` Nicholas Piggin
2017-11-08  4:56                           ` Michael Ellerman
2017-11-08  4:56                             ` Michael Ellerman
2017-11-08  8:30                             ` Kirill A. Shutemov
2017-11-08  8:30                               ` Kirill A. Shutemov
2017-11-06  8:10     ` Florian Weimer
2017-11-06  8:10       ` Florian Weimer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.