linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: POWER: Unexpected fault when writing to brk-allocated memory
       [not found]             ` <546d4155-5b7c-6dba-b642-29c103e336bc@redhat.com>
@ 2017-11-07  5:07               ` Nicholas Piggin
  2017-11-07  8:15                 ` Florian Weimer
  2017-11-07 11:15                 ` Kirill A. Shutemov
  0 siblings, 2 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-07  5:07 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

C'ing everyone who was on the x86 56-bit user virtual address patch.

I think we need more time to discuss this behaviour, in light of the
regression Florian uncovered. I would propose we turn off the 56-bit
user virtual address support for x86 for 4.14, and powerpc would
follow and turn off its 512T support until we can get a better handle
on the problems. (Actually Florian initially hit a couple of bugs in
powerpc implementation, but pulling that string uncovers a whole lot
of difficulties.)

The bi-modal behavior switched based on a combination of mmap address
hint and MAP_FIXED just sucks. It's segregating our VA space with
some non-standard heuristics, and it doesn't seem to work very well.

What are we trying to do? Allow SAP HANA etc use huge address spaces
by coding to these specific mmap heuristics we're going to add,
rather than solving it properly in a way that requires adding a new
syscall or personality or prctl or sysctl. Okay, but the cost is that
despite best efforts, it still changes ABI behaviour for existing
applications and these heuristics will become baked into the ABI that
we will have to support. Not a good tradeoff IMO.

First of all, using addr and MAP_FIXED to develop our heuristic can
never really give unchanged ABI. It's an in-band signal. brk() is a
good example that steadily keeps incrementing address, so depending
on malloc usage and address space randomization, you will get a brk()
that ends exactly at 128T, then the next one will be >
DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

Second, the kernel can never completely solve the problem this way.
How do we know a malloc library will not ask for > 128TB addresses
and pass them to an unknowing application?

And lastly, there are a fair few bugs and places where description
in changelogs and mailing lists does not match code. You don't want
to know the mess in powerpc, but even x86 has two I can see:
MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
indicated it should not), arch_get_unmapped_area_topdown() with an
address hint is checking against TASK_SIZE rather than the limited
128TB address, so it looks like it won't follow the heuristics.

So unless everyone else thinks I'm crazy and disagrees, I'd ask for
a bit more time to make sure we get this interface right. I would
hope for something like prctl PR_SET_MM which can be used to set
our user virtual address bits on a fine grained basis. Maybe a
sysctl, maybe a personality. Something out-of-band. I don't wan to
get too far into that discussion yet. First we need to agree whether
or not the code in the tree today is a problem.

Thanks,
Nick

On Mon, 6 Nov 2017 09:32:25 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:
> > On 11/06/2017 01:55 PM, Nicholas Piggin wrote:  
> >> On Mon, 6 Nov 2017 09:11:37 +0100
> >> Florian Weimer <fweimer@redhat.com> wrote:
> >>  
> >>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:  
> >>>> "You get < 128TB unless explicitly requested."
> >>>>
> >>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
> >>>> some bits in the top of pointers (provided that memory allocator
> >>>> userspace libraries also do the right thing).  
> >>>
> >>> So brk would simplify fail instead of crossing the 128 TiB threshold?  
> >>
> >> Yes, that was the intention and that's what x86 seems to do.
> >>  
> >>>
> >>> glibc malloc should cope with that and switch to malloc, but this code
> >>> path is obviously less well-tested than the regular way.  
> >>
> >> Switch to mmap() I guess you meant?  
> 
> Yes, sorry.
> 
> >> powerpc has a couple of bugs in corner cases, so those should be fixed
> >> according to intended policy for stable kernels I think.
> >>
> >> But I question the policy. Just seems like an ugly and ineffective wart.
> >> Exactly for such cases as this -- behaviour would change from run to run
> >> depending on your address space randomization for example! In case your
> >> brk happens to land nicely on 128TB then the next one would succeed.  
> > 
> > Why ? It should not change between run to run. We limit the free
> > area search range based on hint address. So we should get consistent 
> > results across run. even if we changed the context.addr_limit.  
> 
> The size of the gap to the 128 TiB limit varies between runs because of 
> ASLR.  So some runs would use brk alone, others would use brk + malloc. 
> That's not really desirable IMHO.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07  5:07               ` POWER: Unexpected fault when writing to brk-allocated memory Nicholas Piggin
@ 2017-11-07  8:15                 ` Florian Weimer
  2017-11-07  9:24                   ` Nicholas Piggin
  2017-11-07 11:16                   ` Kirill A. Shutemov
  2017-11-07 11:15                 ` Kirill A. Shutemov
  1 sibling, 2 replies; 22+ messages in thread
From: Florian Weimer @ 2017-11-07  8:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On 11/07/2017 06:07 AM, Nicholas Piggin wrote:

> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

Note that this brk phenomenon is only a concern for some currently 
obscure process memory layouts where the heap ends up at the top of the 
address space.  Usually, there is something above it which eliminates 
the possibility that it can cross into the 128 TiB wilderness.  So the 
brk problem only happens on some architectures (e.g., not x86-64), and 
only with strange ways of running programs (explicitly ld.so invocation 
and likely static PIE, too).

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

There is certainly more demand for similar functionality, like creating 
mappings below 2 GB/4 GB/32 GB, and probably other bit patterns. 
Hotspot would use this to place the heap with compressed oops, instead 
of manually hunting for a suitable place for the mapping.  (Essentially, 
32-bit pointers on 64-bit architectures for sufficiently small heap 
sizes.)  It would perhaps be possible to use the hints address as a 
source of the bit count, for full flexibility.  And the mapping should 
be placed into the upper half of the selected window if possible.

MAP_FIXED is near-impossible to use correctly.  I hope you don't expect 
applications to do that.  If you want address-based opt in, it should 
work without MAP_FIXED.  Sure, in obscure cases, applications might 
still see out-of-range addresses, but I expected a full opt-out based on 
RLIMIT_AS would be sufficient for them.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07  8:15                 ` Florian Weimer
@ 2017-11-07  9:24                   ` Nicholas Piggin
  2017-11-07 11:16                   ` Kirill A. Shutemov
  1 sibling, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-07  9:24 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Tue, 7 Nov 2017 09:15:21 +0100
Florian Weimer <fweimer@redhat.com> wrote:

> On 11/07/2017 06:07 AM, Nicholas Piggin wrote:
> 
> > First of all, using addr and MAP_FIXED to develop our heuristic can
> > never really give unchanged ABI. It's an in-band signal. brk() is a
> > good example that steadily keeps incrementing address, so depending
> > on malloc usage and address space randomization, you will get a brk()
> > that ends exactly at 128T, then the next one will be >
> > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.  
> 
> Note that this brk phenomenon is only a concern for some currently 
> obscure process memory layouts where the heap ends up at the top of the 
> address space.  Usually, there is something above it which eliminates 
> the possibility that it can cross into the 128 TiB wilderness.  So the 
> brk problem only happens on some architectures (e.g., not x86-64), and 
> only with strange ways of running programs (explicitly ld.so invocation 
> and likely static PIE, too).

That's true, but there was an ABI change and the result is that it
changed behaviour.

And actually if it were not for a powerpc bug that caused a segfault,
the allocation would have worked, and you probably would never have
filed the bug. However the program would have been given a pointer >
128TB, which is what we were trying to avoid -- if your app also put
metadata in the top of the pointers, it would have broken.

So, are any obscure apps that could break? I don't know. Maybe it's
quite unlikely. Is that enough to go ahead with changing behaviour?

I'm not going to put up a big fight about this if others feel it's
not a problem. I raise it because in debugging this and seeing the
changed behaviour, the differences between implementations (x86,
powerpc hash, and powerpc radix, all have different edge case
behaviour), and that none of them seem to conform exactly to the
heuristics described in the changelogs, and there seems to be no man
page updates that I can see, it raises some red flags.

So I'm calling for one last opportunity to 1) agree on the desired
behaviour, and 2) ensure implementations are conforming.

> > So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> > a bit more time to make sure we get this interface right. I would
> > hope for something like prctl PR_SET_MM which can be used to set
> > our user virtual address bits on a fine grained basis. Maybe a
> > sysctl, maybe a personality. Something out-of-band. I don't wan to
> > get too far into that discussion yet. First we need to agree whether
> > or not the code in the tree today is a problem.  
> 
> There is certainly more demand for similar functionality, like creating 
> mappings below 2 GB/4 GB/32 GB, and probably other bit patterns. 
> Hotspot would use this to place the heap with compressed oops, instead 
> of manually hunting for a suitable place for the mapping.  (Essentially, 
> 32-bit pointers on 64-bit architectures for sufficiently small heap 
> sizes.)  It would perhaps be possible to use the hints address as a 
> source of the bit count, for full flexibility.  And the mapping should 
> be placed into the upper half of the selected window if possible.
> 
> MAP_FIXED is near-impossible to use correctly.  I hope you don't expect 
> applications to do that.  If you want address-based opt in, it should 
> work without MAP_FIXED.  Sure, in obscure cases, applications might 
> still see out-of-range addresses, but I expected a full opt-out based on 
> RLIMIT_AS would be sufficient for them.

All good points, and no I don't think MAP_FIXED is a substitute.

I don't want to get too far ahead of ourselves, but I don't see why
some new interfaces with reasonably flexible and extensible ways to
specify VA behaviour is a bad idea. We already went through similar
with the ADDR_COMPAT_LAYOUT, and some various address space
randomization options, and now this, and they're all different...
As you noted, for many years now with compressed pointers and data
in pointers, address space bits have *mattered* to applications and
we can't change that.

Surely we're going to have to solve it *properly* sooner or later,
why not now? We could do both existing heuristic *and* make nicer
interfaces, but the >128TB support seems like a good place to do it
because very few apps will have to change code, and we can leave
behaviour exactly unchanged for 99.99999% that will never need so
much memory.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07  5:07               ` POWER: Unexpected fault when writing to brk-allocated memory Nicholas Piggin
  2017-11-07  8:15                 ` Florian Weimer
@ 2017-11-07 11:15                 ` Kirill A. Shutemov
  2017-11-07 11:26                   ` Florian Weimer
  2017-11-07 11:56                   ` Nicholas Piggin
  1 sibling, 2 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:15 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> C'ing everyone who was on the x86 56-bit user virtual address patch.
> 
> I think we need more time to discuss this behaviour, in light of the
> regression Florian uncovered. I would propose we turn off the 56-bit
> user virtual address support for x86 for 4.14, and powerpc would
> follow and turn off its 512T support until we can get a better handle
> on the problems. (Actually Florian initially hit a couple of bugs in
> powerpc implementation, but pulling that string uncovers a whole lot
> of difficulties.)
> 
> The bi-modal behavior switched based on a combination of mmap address
> hint and MAP_FIXED just sucks. It's segregating our VA space with
> some non-standard heuristics, and it doesn't seem to work very well.
> 
> What are we trying to do? Allow SAP HANA etc use huge address spaces
> by coding to these specific mmap heuristics we're going to add,
> rather than solving it properly in a way that requires adding a new
> syscall or personality or prctl or sysctl. Okay, but the cost is that
> despite best efforts, it still changes ABI behaviour for existing
> applications and these heuristics will become baked into the ABI that
> we will have to support. Not a good tradeoff IMO.
> 
> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

No, it won't. You will hit stack first.

> Second, the kernel can never completely solve the problem this way.
> How do we know a malloc library will not ask for > 128TB addresses
> and pass them to an unknowing application?

The idea is that an application can provide hint (mallopt() ?) to malloc
implementation that it's ready to full address space. In this case, malloc
can use mmap((void *) -1,...) for its allocations and get full address
space this way.

> And lastly, there are a fair few bugs and places where description
> in changelogs and mailing lists does not match code. You don't want
> to know the mess in powerpc, but even x86 has two I can see:
> MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> indicated it should not),

Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
shouldn't work. My intention was that it should, although I haven't stated
it in the changelog.

The idea was we shouldn't allow to slip above 47-bits by accidentally.

Correctly functioning program would never request addr+len above 47-bit
with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
request would simply fail on machine that doesn't support large VA.

In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
machine that doesn't support large VA, kernel will find another place
under 47-bit. And I can imagine a reasonable application that does
something like this.

So we cannot rely that application is ready to handle large
addresses if we see addr+len without MAP_FIXED.

> arch_get_unmapped_area_topdown() with an address hint is checking
> against TASK_SIZE rather than the limited 128TB address, so it looks
> like it won't follow the heuristics.

You are right. This is broken. If user would request mapping above vdso,
but below DEFAULT_MAP_WINDOW it will succeed.

I'll send patch to fix this. But it doesn't look as a show-stopper to me.

Re-checking things for this reply I found actual bug, see:

http://lkml.kernel.org/r/20171107103804.47341-1-kirill.shutemov@linux.intel.com

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

Well, we've discussed before all options you are proposing.
Linus wanted a minimalistic interface, so we took this path for now.
We can always add more ways to get access to full address space later.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07  8:15                 ` Florian Weimer
  2017-11-07  9:24                   ` Nicholas Piggin
@ 2017-11-07 11:16                   ` Kirill A. Shutemov
  1 sibling, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 09:15:21AM +0100, Florian Weimer wrote:
> MAP_FIXED is near-impossible to use correctly.  I hope you don't expect
> applications to do that.  If you want address-based opt in, it should work
> without MAP_FIXED.  Sure, in obscure cases, applications might still see
> out-of-range addresses, but I expected a full opt-out based on RLIMIT_AS
> would be sufficient for them.

Just use mmap(-1), without MAP_FIXED to get full address space.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:15                 ` Kirill A. Shutemov
@ 2017-11-07 11:26                   ` Florian Weimer
  2017-11-07 11:44                     ` Kirill A. Shutemov
  2017-11-07 11:56                   ` Nicholas Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2017-11-07 11:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, Nicholas Piggin
  Cc: Aneesh Kumar K.V, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:

>> First of all, using addr and MAP_FIXED to develop our heuristic can
>> never really give unchanged ABI. It's an in-band signal. brk() is a
>> good example that steadily keeps incrementing address, so depending
>> on malloc usage and address space randomization, you will get a brk()
>> that ends exactly at 128T, then the next one will be >
>> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> 
> No, it won't. You will hit stack first.

That's not actually true on POWER in some cases.  See the process maps I 
posted here:

   <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:26                   ` Florian Weimer
@ 2017-11-07 11:44                     ` Kirill A. Shutemov
  2017-11-07 13:05                       ` Florian Weimer
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 11:44 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> 
> > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > good example that steadily keeps incrementing address, so depending
> > > on malloc usage and address space randomization, you will get a brk()
> > > that ends exactly at 128T, then the next one will be >
> > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > 
> > No, it won't. You will hit stack first.
> 
> That's not actually true on POWER in some cases.  See the process maps I
> posted here:
> 
>   <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>

Hm? I see that in all three cases the [stack] is the last mapping.
Do I miss something?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:15                 ` Kirill A. Shutemov
  2017-11-07 11:26                   ` Florian Weimer
@ 2017-11-07 11:56                   ` Nicholas Piggin
  2017-11-07 12:28                     ` Kirill A. Shutemov
  1 sibling, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-07 11:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 14:15:43 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> > C'ing everyone who was on the x86 56-bit user virtual address patch.
> > 
> > I think we need more time to discuss this behaviour, in light of the
> > regression Florian uncovered. I would propose we turn off the 56-bit
> > user virtual address support for x86 for 4.14, and powerpc would
> > follow and turn off its 512T support until we can get a better handle
> > on the problems. (Actually Florian initially hit a couple of bugs in
> > powerpc implementation, but pulling that string uncovers a whole lot
> > of difficulties.)
> > 
> > The bi-modal behavior switched based on a combination of mmap address
> > hint and MAP_FIXED just sucks. It's segregating our VA space with
> > some non-standard heuristics, and it doesn't seem to work very well.
> > 
> > What are we trying to do? Allow SAP HANA etc use huge address spaces
> > by coding to these specific mmap heuristics we're going to add,
> > rather than solving it properly in a way that requires adding a new
> > syscall or personality or prctl or sysctl. Okay, but the cost is that
> > despite best efforts, it still changes ABI behaviour for existing
> > applications and these heuristics will become baked into the ABI that
> > we will have to support. Not a good tradeoff IMO.
> > 
> > First of all, using addr and MAP_FIXED to develop our heuristic can
> > never really give unchanged ABI. It's an in-band signal. brk() is a
> > good example that steadily keeps incrementing address, so depending
> > on malloc usage and address space randomization, you will get a brk()
> > that ends exactly at 128T, then the next one will be >
> > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.  
> 
> No, it won't. You will hit stack first.

I guess so. Florian's bug didn't crash there for some reason, okay
but I suppose my point about brk is not exactly where the standard
heap is, but the pattern of allocations. An allocator that uses
mmap for managing its address space might do the same thing, e.g.,
incrementally expand existing mmaps as necessary.

> > Second, the kernel can never completely solve the problem this way.
> > How do we know a malloc library will not ask for > 128TB addresses
> > and pass them to an unknowing application?  
> 
> The idea is that an application can provide hint (mallopt() ?) to malloc
> implementation that it's ready to full address space. In this case, malloc
> can use mmap((void *) -1,...) for its allocations and get full address
> space this way.

Point is, there's nothing stopping an allocator library or runtime
from asking for mmap anywhere and returning it to userspace.

Do > 128TB pointers matter so much that we should add this heuristic
to prevent breakage, but little enough that we can accept some rare
cases getting through? Genuine question.

> 
> > And lastly, there are a fair few bugs and places where description
> > in changelogs and mailing lists does not match code. You don't want
> > to know the mess in powerpc, but even x86 has two I can see:
> > MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> > indicated it should not),  
> 
> Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
> shouldn't work. My intention was that it should, although I haven't stated
> it in the changelog.

    To mitigate this, we are not going to allocate virtual address space
    above 47-bit by default.
    
    But userspace can ask for allocation from full address space by
    specifying hint address (with or without MAP_FIXED) above 47-bits.

Yet we got 48 bit address with 47 bit address (with MAP_FIXED).

> 
> The idea was we shouldn't allow to slip above 47-bits by accidentally.
> 
> Correctly functioning program would never request addr+len above 47-bit
> with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> request would simply fail on machine that doesn't support large VA.
> 
> In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> machine that doesn't support large VA, kernel will find another place
> under 47-bit. And I can imagine a reasonable application that does
> something like this.
> 
> So we cannot rely that application is ready to handle large
> addresses if we see addr+len without MAP_FIXED.

By the same logic, a request for addr > 128TB without MAP_FIXED will
not fail, therefore we can't rely on that either.

Or an app that links to a library that attempts MAP_FIXED allocation
of addr + len above 128TB might use high bits of pointer returned by
that library because those are never satisfied today and the library
would fall back.

> 
> > arch_get_unmapped_area_topdown() with an address hint is checking
> > against TASK_SIZE rather than the limited 128TB address, so it looks
> > like it won't follow the heuristics.  
> 
> You are right. This is broken. If user would request mapping above vdso,
> but below DEFAULT_MAP_WINDOW it will succeed.
> 
> I'll send patch to fix this. But it doesn't look as a show-stopper to me.
> 
> Re-checking things for this reply I found actual bug, see:
> 
> http://lkml.kernel.org/r/20171107103804.47341-1-kirill.shutemov@linux.intel.com

It's not a show stopper per se, and of course I don't expect code to be
bug free, it's just concerning that we haven't had any kind of regression
testing or observations that have checked for this.

> 
> > So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> > a bit more time to make sure we get this interface right. I would
> > hope for something like prctl PR_SET_MM which can be used to set
> > our user virtual address bits on a fine grained basis. Maybe a
> > sysctl, maybe a personality. Something out-of-band. I don't wan to
> > get too far into that discussion yet. First we need to agree whether
> > or not the code in the tree today is a problem.  
> 
> Well, we've discussed before all options you are proposing.
> Linus wanted a minimalistic interface, so we took this path for now.
> We can always add more ways to get access to full address space later.
> 

Right, I'm just raising it again with additional justifications. I don't
think it's the right way to go or has to be rushed now.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:56                   ` Nicholas Piggin
@ 2017-11-07 12:28                     ` Kirill A. Shutemov
  2017-11-07 13:33                       ` Nicholas Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 12:28 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 10:56:36PM +1100, Nicholas Piggin wrote:
> > No, it won't. You will hit stack first.
> 
> I guess so. Florian's bug didn't crash there for some reason, okay
> but I suppose my point about brk is not exactly where the standard
> heap is, but the pattern of allocations. An allocator that uses
> mmap for managing its address space might do the same thing, e.g.,
> incrementally expand existing mmaps as necessary.

With MAP_FIXED? I don't think so.

> > > Second, the kernel can never completely solve the problem this way.
> > > How do we know a malloc library will not ask for > 128TB addresses
> > > and pass them to an unknowing application?  
> > 
> > The idea is that an application can provide hint (mallopt() ?) to malloc
> > implementation that it's ready to full address space. In this case, malloc
> > can use mmap((void *) -1,...) for its allocations and get full address
> > space this way.
> 
> Point is, there's nothing stopping an allocator library or runtime
> from asking for mmap anywhere and returning it to userspace.

Right. Nobody would stop it from doing stupid things. There are many
things that a library may do that application would not be happy about.

> Do > 128TB pointers matter so much that we should add this heuristic
> to prevent breakage, but little enough that we can accept some rare
> cases getting through? Genuine question.

At the end of the day what matters is if heuristic helps prevent breakage
of existing userspace and doesn't stay in the way of legitimate use of
full address space.

So far, it looks okay to me.

> > The idea was we shouldn't allow to slip above 47-bits by accidentally.
> > 
> > Correctly functioning program would never request addr+len above 47-bit
> > with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> > request would simply fail on machine that doesn't support large VA.
> > 
> > In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> > machine that doesn't support large VA, kernel will find another place
> > under 47-bit. And I can imagine a reasonable application that does
> > something like this.
> > 
> > So we cannot rely that application is ready to handle large
> > addresses if we see addr+len without MAP_FIXED.
> 
> By the same logic, a request for addr > 128TB without MAP_FIXED will
> not fail, therefore we can't rely on that either.
> 
> Or an app that links to a library that attempts MAP_FIXED allocation
> of addr + len above 128TB might use high bits of pointer returned by
> that library because those are never satisfied today and the library
> would fall back.

If you want to point that it's ABI break, yes it is.

But we allow ABI break as long as nobody notices. I think it's reasonable
to expect that nobody relies on such corner cases.

If we would find any piece of software affect by the change we would need
to reconsider.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 11:44                     ` Kirill A. Shutemov
@ 2017-11-07 13:05                       ` Florian Weimer
  2017-11-07 13:16                         ` Kirill A. Shutemov
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2017-11-07 13:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
> On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>>
>>>> First of all, using addr and MAP_FIXED to develop our heuristic can
>>>> never really give unchanged ABI. It's an in-band signal. brk() is a
>>>> good example that steadily keeps incrementing address, so depending
>>>> on malloc usage and address space randomization, you will get a brk()
>>>> that ends exactly at 128T, then the next one will be >
>>>> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>>>
>>> No, it won't. You will hit stack first.
>>
>> That's not actually true on POWER in some cases.  See the process maps I
>> posted here:
>>
>>    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
> 
> Hm? I see that in all three cases the [stack] is the last mapping.
> Do I miss something?

Hah, I had not noticed.  Occasionally, the order of heap and stack is 
reversed.  This happens in approximately 15% of the runs.

See the attached example.

Thanks,
Florian

[-- Attachment #2: maps.txt --]
[-- Type: text/plain, Size: 2684 bytes --]

7fffacc50000-7fffacc90000 rw-p 00000000 00:00 0 
7fffacc90000-7fffaccf0000 r--p 00000000 fd:00 25167925                   /usr/lib/locale/en_US.utf8/LC_CTYPE
7fffaccf0000-7fffacd00000 r--p 00000000 fd:00 25167928                   /usr/lib/locale/en_US.utf8/LC_NUMERIC
7fffacd00000-7fffacd10000 r--p 00000000 fd:00 16798929                   /usr/lib/locale/en_US.utf8/LC_TIME
7fffacd10000-7ffface40000 r--p 00000000 fd:00 25167924                   /usr/lib/locale/en_US.utf8/LC_COLLATE
7ffface40000-7ffface50000 r--p 00000000 fd:00 16798927                   /usr/lib/locale/en_US.utf8/LC_MONETARY
7ffface50000-7ffface60000 r--p 00000000 fd:00 2511                       /usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7ffface60000-7ffface70000 r--p 00000000 fd:00 16798942                   /usr/lib/locale/en_US.utf8/LC_PAPER
7ffface70000-7ffface80000 r--p 00000000 fd:00 25167927                   /usr/lib/locale/en_US.utf8/LC_NAME
7ffface80000-7ffface90000 r--p 00000000 fd:00 16798924                   /usr/lib/locale/en_US.utf8/LC_ADDRESS
7ffface90000-7fffacea0000 r--p 00000000 fd:00 16798928                   /usr/lib/locale/en_US.utf8/LC_TELEPHONE
7fffacea0000-7fffaceb0000 r--p 00000000 fd:00 16798926                   /usr/lib/locale/en_US.utf8/LC_MEASUREMENT
7fffaceb0000-7fffacec0000 r--s 00000000 fd:00 8390657                    /usr/lib64/gconv/gconv-modules.cache
7fffacec0000-7fffad0d0000 r-xp 00000000 fd:00 8390335                    /usr/lib64/libc-2.25.so
7fffad0d0000-7fffad0e0000 ---p 00210000 fd:00 8390335                    /usr/lib64/libc-2.25.so
7fffad0e0000-7fffad0f0000 r--p 00210000 fd:00 8390335                    /usr/lib64/libc-2.25.so
7fffad0f0000-7fffad100000 rw-p 00220000 fd:00 8390335                    /usr/lib64/libc-2.25.so
7fffad100000-7fffad110000 r--p 00000000 fd:00 16798925                   /usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
7fffad110000-7fffad120000 r-xp 00000000 fd:00 63543                      /usr/bin/cat
7fffad120000-7fffad130000 r--p 00000000 fd:00 63543                      /usr/bin/cat
7fffad130000-7fffad140000 rw-p 00010000 fd:00 63543                      /usr/bin/cat
7fffad140000-7fffad160000 r-xp 00000000 00:00 0                          [vdso]
7fffad160000-7fffad1a0000 r-xp 00000000 fd:00 8390328                    /usr/lib64/ld-2.25.so
7fffad1a0000-7fffad1b0000 r--p 00030000 fd:00 8390328                    /usr/lib64/ld-2.25.so
7fffad1b0000-7fffad1c0000 rw-p 00040000 fd:00 8390328                    /usr/lib64/ld-2.25.so
7fffc2cf0000-7fffc2d20000 rw-p 00000000 00:00 0                          [stack]
7fffc8c10000-7fffc8c40000 rw-p 00000000 00:00 0                          [heap]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:05                       ` Florian Weimer
@ 2017-11-07 13:16                         ` Kirill A. Shutemov
  2017-11-08  6:08                           ` Michael Ellerman
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 13:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Nicholas Piggin, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> > > 
> > > > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > > > never really give unchanged ABI. It's an in-band signal. brk() is a
> > > > > good example that steadily keeps incrementing address, so depending
> > > > > on malloc usage and address space randomization, you will get a brk()
> > > > > that ends exactly at 128T, then the next one will be >
> > > > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
> > > > 
> > > > No, it won't. You will hit stack first.
> > > 
> > > That's not actually true on POWER in some cases.  See the process maps I
> > > posted here:
> > > 
> > >    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
> > 
> > Hm? I see that in all three cases the [stack] is the last mapping.
> > Do I miss something?
> 
> Hah, I had not noticed.  Occasionally, the order of heap and stack is
> reversed.  This happens in approximately 15% of the runs.

Heh. I guess ASLR on Power is too fancy :)

That's strange layout. It doesn't give that much (relatively speaking)
virtual address space for both stack and heap to grow.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 12:28                     ` Kirill A. Shutemov
@ 2017-11-07 13:33                       ` Nicholas Piggin
  2017-11-07 13:45                         ` Aneesh Kumar K.V
  0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-07 13:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Florian Weimer, Aneesh Kumar K.V, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 15:28:25 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 10:56:36PM +1100, Nicholas Piggin wrote:
> > > No, it won't. You will hit stack first.  
> > 
> > I guess so. Florian's bug didn't crash there for some reason, okay
> > but I suppose my point about brk is not exactly where the standard
> > heap is, but the pattern of allocations. An allocator that uses
> > mmap for managing its address space might do the same thing, e.g.,
> > incrementally expand existing mmaps as necessary.  
> 
> With MAP_FIXED? I don't think so.

brk() based allocator effectively usees MAP_FIXED. If you know where
your addresses are, using MAP_FIXED can be used.

But okay let's ignore MAP_FIXED as a corner case. Then what happens
with !MAP_FIXED when an allocation ends exactly at 128TB and then next
one begins at 128TB? Won't that expand the address space? Should we
ignore that corner case too?

> 
> > > > Second, the kernel can never completely solve the problem this way.
> > > > How do we know a malloc library will not ask for > 128TB addresses
> > > > and pass them to an unknowing application?    
> > > 
> > > The idea is that an application can provide hint (mallopt() ?) to malloc
> > > implementation that it's ready to full address space. In this case, malloc
> > > can use mmap((void *) -1,...) for its allocations and get full address
> > > space this way.  
> > 
> > Point is, there's nothing stopping an allocator library or runtime
> > from asking for mmap anywhere and returning it to userspace.  
> 
> Right. Nobody would stop it from doing stupid things. There are many
> things that a library may do that application would not be happy about.

Indeed.

> 
> > Do > 128TB pointers matter so much that we should add this heuristic
> > to prevent breakage, but little enough that we can accept some rare
> > cases getting through? Genuine question.  
> 
> At the end of the day what matters is if heuristic helps prevent breakage
> of existing userspace and doesn't stay in the way of legitimate use of
> full address space.
> 
> So far, it looks okay to me.

Well that wasn't really the point of my question. Yes of course that
is important. But the question is how are these heuristics chosen and
evaluated? Why is this a good change to make?

We've decided there is some benefit from preventing 128TB pointers, but
also not enough that we have to completely prevent them accidentally
being returned. Yet it's important enough to make mmap behaviour diverge
in about 5 ways around 128TB depending on what combination of address
and length and MAP_FIXED you specify, and introducing this new way to
use the interface to get an expanded address space?

And we're doing this because we don't want to add a prctl or personality
or whatever for SAP HANA, because it was decided that would be too complex?

> 
> > > The idea was we shouldn't allow to slip above 47-bits by accidentally.
> > > 
> > > Correctly functioning program would never request addr+len above 47-bit
> > > with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
> > > request would simply fail on machine that doesn't support large VA.
> > > 
> > > In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
> > > machine that doesn't support large VA, kernel will find another place
> > > under 47-bit. And I can imagine a reasonable application that does
> > > something like this.
> > > 
> > > So we cannot rely that application is ready to handle large
> > > addresses if we see addr+len without MAP_FIXED.  
> > 
> > By the same logic, a request for addr > 128TB without MAP_FIXED will
> > not fail, therefore we can't rely on that either.
> > 
> > Or an app that links to a library that attempts MAP_FIXED allocation
> > of addr + len above 128TB might use high bits of pointer returned by
> > that library because those are never satisfied today and the library
> > would fall back.  
> 
> If you want to point that it's ABI break, yes it is.
> 
> But we allow ABI break as long as nobody notices. I think it's reasonable
> to expect that nobody relies on such corner cases.

I accept the point, but I do worry mmap syscall is very fundamental,
and that it is used in a lot of ways that are difficult to foresee,
so there's a lot of room for unintended consequences. Plus the state
of the code (that's not to pick on x86, powerpc is worse) is worrying.

> 
> If we would find any piece of software affect by the change we would need
> to reconsider.
> 

Problem is that if there was an issue caused by this, it's unlikely
to be found until a long time later. By that time, hopefully there
aren't too many other apps that are now rely on the behaviour if it
has to be changed.

I just don't see why that's a risk worth taking at all. I can't see how
the cost benefit is there. I have to confess I could have been more
involved in discussion, but is any other interface honestly too complex
to add?

If it is decided to keep these kind of heuristics, can we get just a
small but reasonably precise description of each change to the
interface and ways for using the new functionality, such that would be
suitable for the man page? I couldn't fix powerpc because nothing
matches and even Aneesh and you differ on some details (MAP_FIXED
behaviour).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:33                       ` Nicholas Piggin
@ 2017-11-07 13:45                         ` Aneesh Kumar K.V
  2017-11-07 14:01                           ` Kirill A. Shutemov
  2017-11-08  4:56                           ` Michael Ellerman
  0 siblings, 2 replies; 22+ messages in thread
From: Aneesh Kumar K.V @ 2017-11-07 13:45 UTC (permalink / raw)
  To: Nicholas Piggin, Kirill A. Shutemov
  Cc: Florian Weimer, Kirill A. Shutemov, linuxppc-dev, linux-mm,
	Andrew Morton, Andy Lutomirski, Dave Hansen, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List


> 
> If it is decided to keep these kind of heuristics, can we get just a
> small but reasonably precise description of each change to the
> interface and ways for using the new functionality, such that would be
> suitable for the man page? I couldn't fix powerpc because nothing
> matches and even Aneesh and you differ on some details (MAP_FIXED
> behaviour).


I would consider MAP_FIXED as my mistake. We never discussed this 
explicitly and I kind of assumed it to behave the same way. ie, we 
search in lower address space (128TB) if the hint addr is below 128TB.

IIUC we agree on the below.

1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
but hint_addr + len is > 128TB.

2) For everything else we search in < 128TB space if hint addr is below 
128TB

3) We don't switch to large address space if hint_addr + len > 128TB. 
The decision to switch to large address space is primarily based on hint 
addr

Is there any other rule we need to outline? Or is any of the above not 
correct?

-aneesh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:45                         ` Aneesh Kumar K.V
@ 2017-11-07 14:01                           ` Kirill A. Shutemov
  2017-11-09 17:15                             ` Nicholas Piggin
  2017-11-08  4:56                           ` Michael Ellerman
  1 sibling, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2017-11-07 14:01 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Nicholas Piggin, Florian Weimer, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, Nov 07, 2017 at 07:15:58PM +0530, Aneesh Kumar K.V wrote:
> 
> > 
> > If it is decided to keep these kind of heuristics, can we get just a
> > small but reasonably precise description of each change to the
> > interface and ways for using the new functionality, such that would be
> > suitable for the man page? I couldn't fix powerpc because nothing
> > matches and even Aneesh and you differ on some details (MAP_FIXED
> > behaviour).
> 
> 
> I would consider MAP_FIXED as my mistake. We never discussed this explicitly
> and I kind of assumed it to behave the same way. ie, we search in lower
> address space (128TB) if the hint addr is below 128TB.
> 
> IIUC we agree on the below.
> 
> 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB but
> hint_addr + len is > 128TB.
> 
> 2) For everything else we search in < 128TB space if hint addr is below
> 128TB
> 
> 3) We don't switch to large address space if hint_addr + len > 128TB. The
> decision to switch to large address space is primarily based on hint addr
> 
> Is there any other rule we need to outline? Or is any of the above not
> correct?

That's correct.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:45                         ` Aneesh Kumar K.V
  2017-11-07 14:01                           ` Kirill A. Shutemov
@ 2017-11-08  4:56                           ` Michael Ellerman
  2017-11-08  8:30                             ` Kirill A. Shutemov
  1 sibling, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2017-11-08  4:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Nicholas Piggin, Kirill A. Shutemov
  Cc: Florian Weimer, linux-arch, Dave Hansen, Peter Zijlstra,
	Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Andy Lutomirski, linux-mm, Andrew Morton, linuxppc-dev,
	Thomas Gleixner, Kirill A. Shutemov

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

>> 
>> If it is decided to keep these kind of heuristics, can we get just a
>> small but reasonably precise description of each change to the
>> interface and ways for using the new functionality, such that would be
>> suitable for the man page? I couldn't fix powerpc because nothing
>> matches and even Aneesh and you differ on some details (MAP_FIXED
>> behaviour).
>
>
> I would consider MAP_FIXED as my mistake. We never discussed this 
> explicitly and I kind of assumed it to behave the same way. ie, we 
> search in lower address space (128TB) if the hint addr is below 128TB.
>
> IIUC we agree on the below.
>
> 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
> but hint_addr + len is > 128TB.

So:
  mmap(0x7ffffffff000, 0x2000, ..., MAP_FIXED ...) = 0x7ffffffff000

> 2) For everything else we search in < 128TB space if hint addr is below 
> 128TB

  mmap((x < 128T), 0x1000, ...) = (y < 128T)
  ...
  mmap(0x7ffffffff000, 0x1000, ...) = 0x7ffffffff000
  mmap(0x800000000000, 0x1000, ...) = 0x800000000000
  ...
  mmap((x >= 128T), 0x1000, ...) = (y >= 128T)

> 3) We don't switch to large address space if hint_addr + len > 128TB. 
> The decision to switch to large address space is primarily based on hint 
> addr

But does the mmap succeed in that case or not?

ie:  mmap(0x7ffffffff000, 0x2000, ...) = ?

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 13:16                         ` Kirill A. Shutemov
@ 2017-11-08  6:08                           ` Michael Ellerman
  2017-11-08  6:18                             ` Florian Weimer
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2017-11-08  6:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Florian Weimer, Kees Cook
  Cc: linux-arch, Dave Hansen, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, Linux Kernel Mailing List, Nicholas Piggin,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Andrew Morton,
	linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
>> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
>> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>> > > 
>> > > > > First of all, using addr and MAP_FIXED to develop our heuristic can
>> > > > > never really give unchanged ABI. It's an in-band signal. brk() is a
>> > > > > good example that steadily keeps incrementing address, so depending
>> > > > > on malloc usage and address space randomization, you will get a brk()
>> > > > > that ends exactly at 128T, then the next one will be >
>> > > > > DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>> > > > 
>> > > > No, it won't. You will hit stack first.
>> > > 
>> > > That's not actually true on POWER in some cases.  See the process maps I
>> > > posted here:
>> > > 
>> > >    <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
>> > 
>> > Hm? I see that in all three cases the [stack] is the last mapping.
>> > Do I miss something?
>> 
>> Hah, I had not noticed.  Occasionally, the order of heap and stack is
>> reversed.  This happens in approximately 15% of the runs.
>
> Heh. I guess ASLR on Power is too fancy :)

Fancy implies we're doing it on purpose :P

> That's strange layout. It doesn't give that much (relatively speaking)
> virtual address space for both stack and heap to grow.

I'm pretty sure it only happens when you're running an ELF interpreter
directly, because of Kees patch which changed the logic to load ELF
interpreters in the mmap region, vs PIE binaries which go to
ELF_ET_DYN_BASE. (eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only
for PIE"))

It only happens with ASLR enabled. Presumably it's because our brk_rnd()
is overly aggressive in this case, it randomises up to 1GB, and the heap
jumps over the stack.

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-08  6:08                           ` Michael Ellerman
@ 2017-11-08  6:18                             ` Florian Weimer
  0 siblings, 0 replies; 22+ messages in thread
From: Florian Weimer @ 2017-11-08  6:18 UTC (permalink / raw)
  To: Michael Ellerman, Kirill A. Shutemov, Kees Cook
  Cc: linux-arch, Dave Hansen, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, Linux Kernel Mailing List, Nicholas Piggin,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Andrew Morton,
	linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On 11/08/2017 07:08 AM, Michael Ellerman wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
>> On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
>>> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
>>>> On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
>>>>> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>>>>>
>>>>>>> First of all, using addr and MAP_FIXED to develop our heuristic can
>>>>>>> never really give unchanged ABI. It's an in-band signal. brk() is a
>>>>>>> good example that steadily keeps incrementing address, so depending
>>>>>>> on malloc usage and address space randomization, you will get a brk()
>>>>>>> that ends exactly at 128T, then the next one will be >
>>>>>>> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.
>>>>>>
>>>>>> No, it won't. You will hit stack first.
>>>>>
>>>>> That's not actually true on POWER in some cases.  See the process maps I
>>>>> posted here:
>>>>>
>>>>>     <https://marc.info/?l=linuxppc-embedded&m=150988538106263&w=2>
>>>>
>>>> Hm? I see that in all three cases the [stack] is the last mapping.
>>>> Do I miss something?
>>>
>>> Hah, I had not noticed.  Occasionally, the order of heap and stack is
>>> reversed.  This happens in approximately 15% of the runs.
>>
>> Heh. I guess ASLR on Power is too fancy :)
> 
> Fancy implies we're doing it on purpose :P
> 
>> That's strange layout. It doesn't give that much (relatively speaking)
>> virtual address space for both stack and heap to grow.
> 
> I'm pretty sure it only happens when you're running an ELF interpreter
> directly, because of Kees patch which changed the logic to load ELF
> interpreters in the mmap region, vs PIE binaries which go to
> ELF_ET_DYN_BASE. (eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only
> for PIE"))

 From that commit:

+                   * There are effectively two types of ET_DYN
+                   * binaries: programs (i.e. PIE: ET_DYN with INTERP)
+                   * and loaders (ET_DYN without INTERP, since they
+                   * _are_ the ELF interpreter). The loaders must

Note that the comment is a bit misleading: statically linked PIE 
binaries are ET_DYN without INTERP, too.

So any oddity which is observable today with an explicitly ld.so 
invocation only will gain more relevance once we get static PIE support 
in user space because it will then affect regular applications, too. 
(Well, statically linked ones.)  In this sense, process layouts which 
cause premature brk failure or insufficient stack allocations are real bugs.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-08  4:56                           ` Michael Ellerman
@ 2017-11-08  8:30                             ` Kirill A. Shutemov
  0 siblings, 0 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2017-11-08  8:30 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Aneesh Kumar K.V, Nicholas Piggin, Florian Weimer, linux-arch,
	Dave Hansen, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Andy Lutomirski, linux-mm,
	Andrew Morton, linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On Wed, Nov 08, 2017 at 03:56:06PM +1100, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:
> 
> >> 
> >> If it is decided to keep these kind of heuristics, can we get just a
> >> small but reasonably precise description of each change to the
> >> interface and ways for using the new functionality, such that would be
> >> suitable for the man page? I couldn't fix powerpc because nothing
> >> matches and even Aneesh and you differ on some details (MAP_FIXED
> >> behaviour).
> >
> >
> > I would consider MAP_FIXED as my mistake. We never discussed this 
> > explicitly and I kind of assumed it to behave the same way. ie, we 
> > search in lower address space (128TB) if the hint addr is below 128TB.
> >
> > IIUC we agree on the below.
> >
> > 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB 
> > but hint_addr + len is > 128TB.
> 
> So:
>   mmap(0x7ffffffff000, 0x2000, ..., MAP_FIXED ...) = 0x7ffffffff000
> 
> > 2) For everything else we search in < 128TB space if hint addr is below 
> > 128TB
> 
>   mmap((x < 128T), 0x1000, ...) = (y < 128T)
>   ...
>   mmap(0x7ffffffff000, 0x1000, ...) = 0x7ffffffff000
>   mmap(0x800000000000, 0x1000, ...) = 0x800000000000
>   ...
>   mmap((x >= 128T), 0x1000, ...) = (y >= 128T)
> 
> > 3) We don't switch to large address space if hint_addr + len > 128TB. 
> > The decision to switch to large address space is primarily based on hint 
> > addr
> 
> But does the mmap succeed in that case or not?
> 
> ie:  mmap(0x7ffffffff000, 0x2000, ...) = ?

It does, but resulting address doesn't match the hint. It's somewhere
below 47-bit border.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-07 14:01                           ` Kirill A. Shutemov
@ 2017-11-09 17:15                             ` Nicholas Piggin
  2017-11-09 19:44                               ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-09 17:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Aneesh Kumar K.V, Florian Weimer, Kirill A. Shutemov,
	linuxppc-dev, linux-mm, Andrew Morton, Andy Lutomirski,
	Dave Hansen, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	linux-arch, Ingo Molnar, Linux Kernel Mailing List

On Tue, 7 Nov 2017 17:01:58 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Tue, Nov 07, 2017 at 07:15:58PM +0530, Aneesh Kumar K.V wrote:
> >   
> > > 
> > > If it is decided to keep these kind of heuristics, can we get just a
> > > small but reasonably precise description of each change to the
> > > interface and ways for using the new functionality, such that would be
> > > suitable for the man page? I couldn't fix powerpc because nothing
> > > matches and even Aneesh and you differ on some details (MAP_FIXED
> > > behaviour).  
> > 
> > 
> > I would consider MAP_FIXED as my mistake. We never discussed this explicitly
> > and I kind of assumed it to behave the same way. ie, we search in lower
> > address space (128TB) if the hint addr is below 128TB.
> > 
> > IIUC we agree on the below.
> > 
> > 1) MAP_FIXED allow the addr to be used, even if hint addr is below 128TB but
> > hint_addr + len is > 128TB.
> > 
> > 2) For everything else we search in < 128TB space if hint addr is below
> > 128TB
> > 
> > 3) We don't switch to large address space if hint_addr + len > 128TB. The
> > decision to switch to large address space is primarily based on hint addr
> > 
> > Is there any other rule we need to outline? Or is any of the above not
> > correct?  
> 
> That's correct.
> 

Thanks guys, I'll send out some powerpc patches to match -- it deviates in
its MAP_FIXED handling (treats it the same as !MAP_FIXED).

So these semantics are what we're going with? Anything that does mmap() is
guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
itself? Is intended to be cross-platform or just x86 and power specific?

Also, this may follow from deduction from 1-3, but for explicit
specification in man page:

4) To get an unspecified allocation with the largest possible address range,
we pass in -1 for mmap hint.

Are we allowing 8 bits bits of unused address in this case, or must the
app not assume anything about number of bits used?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-09 17:15                             ` Nicholas Piggin
@ 2017-11-09 19:44                               ` Matthew Wilcox
  2017-11-10  1:26                                 ` Nicholas Piggin
       [not found]                                 ` <063D6719AE5E284EB5DD2968C1650D6DD00B84EF@AcuExch.aculab.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Matthew Wilcox @ 2017-11-09 19:44 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Kirill A. Shutemov, Aneesh Kumar K.V, Florian Weimer,
	Kirill A. Shutemov, linuxppc-dev, linux-mm, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> So these semantics are what we're going with? Anything that does mmap() is
> guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> itself? Is intended to be cross-platform or just x86 and power specific?

It is x86 and powerpc specific.  The arm64 people have apparently stumbled
across apps that expect to be able to use bit 48 for their own purposes.
And their address space is 48 bit by default.  Oops.

> Also, this may follow from deduction from 1-3, but for explicit
> specification in man page:
> 
> 4) To get an unspecified allocation with the largest possible address range,
> we pass in -1 for mmap hint.
> 
> Are we allowing 8 bits bits of unused address in this case, or must the
> app not assume anything about number of bits used?

Maybe document it as: "If the app wants to use the top N bits of addresses
for its own purposes, pass in (~0UL >> N) as the mmap hint."  ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
  2017-11-09 19:44                               ` Matthew Wilcox
@ 2017-11-10  1:26                                 ` Nicholas Piggin
       [not found]                                 ` <063D6719AE5E284EB5DD2968C1650D6DD00B84EF@AcuExch.aculab.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-10  1:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Aneesh Kumar K.V, Florian Weimer,
	Kirill A. Shutemov, linuxppc-dev, linux-mm, Andrew Morton,
	Andy Lutomirski, Dave Hansen, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, linux-arch, Ingo Molnar,
	Linux Kernel Mailing List

On Thu, 9 Nov 2017 11:44:21 -0800
Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> > So these semantics are what we're going with? Anything that does mmap() is
> > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > itself? Is intended to be cross-platform or just x86 and power specific?  
> 
> It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> across apps that expect to be able to use bit 48 for their own purposes.
> And their address space is 48 bit by default.  Oops.

Okay, so it's something we should make into an "official" API?

> 
> > Also, this may follow from deduction from 1-3, but for explicit
> > specification in man page:
> > 
> > 4) To get an unspecified allocation with the largest possible address range,
> > we pass in -1 for mmap hint.
> > 
> > Are we allowing 8 bits bits of unused address in this case, or must the
> > app not assume anything about number of bits used?  
> 
> Maybe document it as: "If the app wants to use the top N bits of addresses
> for its own purposes, pass in (~0UL >> N) as the mmap hint."  ?

Well we don't have code for that yet, but the problem would also be that
it succeeds, and actually it probably goes over the limit. So you would
have to map a dummy page there so subsequent hints to fail and fall back.
Not sure... it would be nice to be able to specify number of bits, but I
think this gets a bit hairy. -1 to use all bits seems a bit easier.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: POWER: Unexpected fault when writing to brk-allocated memory
       [not found]                                 ` <063D6719AE5E284EB5DD2968C1650D6DD00B84EF@AcuExch.aculab.com>
@ 2017-11-11 10:30                                   ` Nicholas Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2017-11-11 10:30 UTC (permalink / raw)
  To: David Laight
  Cc: 'Matthew Wilcox',
	Florian Weimer, linux-arch, Dave Hansen, Peter Zijlstra,
	Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Andy Lutomirski, linux-mm, Aneesh Kumar K.V, Kirill A. Shutemov,
	Andrew Morton, linuxppc-dev, Thomas Gleixner, Kirill A. Shutemov

On Fri, 10 Nov 2017 12:08:35 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Matthew Wilcox
> > Sent: 09 November 2017 19:44
> > 
> > On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:  
> > > So these semantics are what we're going with? Anything that does mmap() is
> > > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > > itself? Is intended to be cross-platform or just x86 and power specific?  
> > 
> > It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> > across apps that expect to be able to use bit 48 for their own purposes.
> > And their address space is 48 bit by default.  Oops.  
> 
> (Do you mean 49bit?)

I think he meant bit 47, which makes sense because they were probably
ported from x86-64 with 47 bit address. That seems to be why x86-64
5-level and powerpc decided to limit to a 47 bit address space by
default.

> 
> Aren't such apps just doomed to be broken?

Well they're not portable but they are not broken if virtual address
is limited.

> 
> ISTR there is something on (IIRC) sparc64 that does a 'match'
> on the high address bits to make it much harder to overrun
> one area into another.

I'm not sure about that but I think the problem would be the app
masking out bits from the pointer for its own use before ever
dereferencing it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-11-11 10:31 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <f251fc3e-c657-ebe8-acc8-f55ab4caa667@redhat.com>
     [not found] ` <20171105231850.5e313e46@roar.ozlabs.ibm.com>
     [not found]   ` <871slcszfl.fsf@linux.vnet.ibm.com>
     [not found]     ` <20171106174707.19f6c495@roar.ozlabs.ibm.com>
     [not found]       ` <24b93038-76f7-33df-d02e-facb0ce61cd2@redhat.com>
     [not found]         ` <20171106192524.12ea3187@roar.ozlabs.ibm.com>
     [not found]           ` <d52581f4-8ca4-5421-0862-3098031e29a8@linux.vnet.ibm.com>
     [not found]             ` <546d4155-5b7c-6dba-b642-29c103e336bc@redhat.com>
2017-11-07  5:07               ` POWER: Unexpected fault when writing to brk-allocated memory Nicholas Piggin
2017-11-07  8:15                 ` Florian Weimer
2017-11-07  9:24                   ` Nicholas Piggin
2017-11-07 11:16                   ` Kirill A. Shutemov
2017-11-07 11:15                 ` Kirill A. Shutemov
2017-11-07 11:26                   ` Florian Weimer
2017-11-07 11:44                     ` Kirill A. Shutemov
2017-11-07 13:05                       ` Florian Weimer
2017-11-07 13:16                         ` Kirill A. Shutemov
2017-11-08  6:08                           ` Michael Ellerman
2017-11-08  6:18                             ` Florian Weimer
2017-11-07 11:56                   ` Nicholas Piggin
2017-11-07 12:28                     ` Kirill A. Shutemov
2017-11-07 13:33                       ` Nicholas Piggin
2017-11-07 13:45                         ` Aneesh Kumar K.V
2017-11-07 14:01                           ` Kirill A. Shutemov
2017-11-09 17:15                             ` Nicholas Piggin
2017-11-09 19:44                               ` Matthew Wilcox
2017-11-10  1:26                                 ` Nicholas Piggin
     [not found]                                 ` <063D6719AE5E284EB5DD2968C1650D6DD00B84EF@AcuExch.aculab.com>
2017-11-11 10:30                                   ` Nicholas Piggin
2017-11-08  4:56                           ` Michael Ellerman
2017-11-08  8:30                             ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).