All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: JITs and 52-bit VA
       [not found] <4A8E6E6D-6CF7-4964-A62E-467AE287D415@linaro.org>
@ 2016-06-22 14:53 ` Christopher Covington
  2016-06-22 15:13   ` Andy Lutomirski
  2016-06-22 15:40   ` Kirill A. Shutemov
  0 siblings, 2 replies; 15+ messages in thread
From: Christopher Covington @ 2016-06-22 14:53 UTC (permalink / raw)
  To: Maxim Kuvyrkov, Linaro Dev Mailman List
  Cc: Arnd Bergmann, Mark Brown, linux-mm, Dmitry Safonov,
	Andy Lutomirski, Cyrill Gorcunov

+Andy, Cyrill, Dmitry who have been discussing variable TASK_SIZE on x86
on linux-mm

http://marc.info/?l=linux-mm&m=146290118818484&w=2

>>> On 04/28/2016 09:00 AM, Maxim Kuvyrkov wrote:
>>>> This is a summary of discussions we had on IRC between kernel and
>>>> toolchain engineers regarding support for JITs and 52-bit virtual
>>>> address space (mostly in the context of LuaJIT, but this concerns other
>>>> JITs too).
>>>> 
>>>> The summary is that we need to consider ways of reducing the size of
>>>> VA for a given process or container on a Linux system.
>>>> 
>>>> The high-level problem is that JITs tend to use upper bits of
>>>> addresses to encode various pieces of data, and that the number of
>>>> available bits is shrinking due to VA size increasing. With the usual
>>>> 42-bit VA (which is what most JITs assume) they have 22 bits to encode
>>>> various performance-critical data. With 48-bit VA (e.g., ThunderX world)
>>>> things start to get complicated, and JITs need to be non-trivially
>>>> patched at the source level to continue working with less bits available
>>>> for their performance-critical storage. With upcoming 52-bit VA things
>>>> might get dire enough for some JITs to declare such configurations
>>>> unsupported.
>>>> 
>>>> On the other hand, most JITs are not expected to requires terabytes
>>>> of RAM and huge VA for their applications. Most JIT applications will
>>>> happily live in 42-bit world with mere 4 terabytes of RAM that it
>>>> provides. Therefore, what JITs need in the modern world is a way to make
>>>> mmap() return addresses below a certain threshold, and error out with
>>>> ENOMEM when "lower" memory is exhausted. This is very similar to
>>>> ADDR_LIMIT_32BIT personality, but extended to common VA sizes on 64-bit
>>>> systems: 39-bit, 42-bit, 48-bit, 52-bit, etc.
>>>> 
>>>> Since we do not want to penalize the whole system (using an
>>>> artificially low-size VA), it would be best to have a way to enable VA
>>>> limit on per-process basis (similar to ADDR_LIMIT_32BIT personality). If
>>>> that's not possible -- then on per-container / cgroup basis. If that's
>>>> not possible -- then on system level (similar to vm.mmap_min_addr, but
>>>> from the other end).
>>>> 
>>>> Dear kernel people, what can be done to address the JITs need to
>>>> reduce effective VA size?

>> On 04/28/2016 09:17 AM, Arnd Bergmann wrote:
>>> Thanks for the summary, now it all makes much more sense.
>>> 
>>> One simple (from the kernel's perspective, not from the JIT) approach
>>> might be to always use MAP_FIXED whenever an allocation is made for
>>> memory that needs these special pointers, and then manage the available
>>> address space explicitly. Would that work, or do you require everything
>>> including the binary itself to be below the address?
>>> 
>>> Regarding which memory sizes are needed, my impression from your
>>> explanation is that a single personality flag (e.g. ADDR_LIMIT_42BIT)
>>> would be sufficient for the usecase, and you don't actually need to
>>> tie this to the architecture-provided virtual addressing limits
>>> at all. If it's only one such flag, we can probably find a way to fit
>>> it into the personality flags, though ironically we are actually
>>> running out of bits in there as well.

> On 04/28/2016 09:24 AM, Peter Maydell wrote:
>> The trouble IME with this idea is that in practice you're
>> linking with glibc, which means glibc is managing (and using)
>> the address space, not the JIT. So MAP_FIXED is pretty awkward
>> to use.

On 04/28/2016 03:27 PM, Steve Capper wrote:
> One can find holes in the VA space by examining /proc/self/maps, thus
> selection of pointers for MAP_FIXED can be deduced.
>
> The other problem is, as Arnd alluded to, if a JIT'ed object needs to
> then refer to something allocated outside of the JIT. This could be
> remedied by another level of indirection/trampoline.
>
> Taking two steps back though, I would view VA space squeezing as a
> stop-gap before removing tags from the upper bits of a pointer
> altogether (tagging the bottom bits, by controlling alignment is
> perfectly safe). The larger the VA space, the more scope mechanisms
> such as Address Space Layout Randomisation have to improve security.

I was working on an (AArch64-specific) auxiliary vector entry to export
TASK_SIZE to userspace at exec time. The goal was to allow for more
elegant, robust, and efficient replacements for the following changes:

https://hg.mozilla.org/integration/mozilla-inbound/rev/dfaafbaaa291

https://github.com/xemul/criu/commit/c0c0546c31e6df4932669f4740197bb830a24c8d

However based on the above discussion, it appears that some sort of
prctl(PR_GET_TASK_SIZE, ...) and prctl(PR_SET_TASK_SIZE, ...) may be
preferable for AArch64. (And perhaps other justifications for the new
calls influences the x86 decisions.) What do folks think?

Thanks,
Cov

-- 
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 14:53 ` JITs and 52-bit VA Christopher Covington
@ 2016-06-22 15:13   ` Andy Lutomirski
  2016-06-22 19:18     ` Cyrill Gorcunov
  2016-06-23  8:20     ` Dmitry Safonov
  2016-06-22 15:40   ` Kirill A. Shutemov
  1 sibling, 2 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-06-22 15:13 UTC (permalink / raw)
  To: Christopher Covington
  Cc: Maxim Kuvyrkov, Linaro Dev Mailman List, Arnd Bergmann,
	Mark Brown, linux-mm, Dmitry Safonov, Cyrill Gorcunov

On Wed, Jun 22, 2016 at 7:53 AM, Christopher Covington
<cov@codeaurora.org> wrote:
> +Andy, Cyrill, Dmitry who have been discussing variable TASK_SIZE on x86
> on linux-mm
>
> http://marc.info/?l=linux-mm&m=146290118818484&w=2
>
>>>> On 04/28/2016 09:00 AM, Maxim Kuvyrkov wrote:
>>>>> This is a summary of discussions we had on IRC between kernel and
>>>>> toolchain engineers regarding support for JITs and 52-bit virtual
>>>>> address space (mostly in the context of LuaJIT, but this concerns other
>>>>> JITs too).
>>>>>
>>>>> The summary is that we need to consider ways of reducing the size of
>>>>> VA for a given process or container on a Linux system.
>>>>>
>>>>> The high-level problem is that JITs tend to use upper bits of
>>>>> addresses to encode various pieces of data, and that the number of
>>>>> available bits is shrinking due to VA size increasing. With the usual
>>>>> 42-bit VA (which is what most JITs assume) they have 22 bits to encode
>>>>> various performance-critical data. With 48-bit VA (e.g., ThunderX world)
>>>>> things start to get complicated, and JITs need to be non-trivially
>>>>> patched at the source level to continue working with less bits available
>>>>> for their performance-critical storage. With upcoming 52-bit VA things
>>>>> might get dire enough for some JITs to declare such configurations
>>>>> unsupported.
>>>>>
>>>>> On the other hand, most JITs are not expected to requires terabytes
>>>>> of RAM and huge VA for their applications. Most JIT applications will
>>>>> happily live in 42-bit world with mere 4 terabytes of RAM that it
>>>>> provides. Therefore, what JITs need in the modern world is a way to make
>>>>> mmap() return addresses below a certain threshold, and error out with
>>>>> ENOMEM when "lower" memory is exhausted. This is very similar to
>>>>> ADDR_LIMIT_32BIT personality, but extended to common VA sizes on 64-bit
>>>>> systems: 39-bit, 42-bit, 48-bit, 52-bit, etc.
>>>>>
>>>>> Since we do not want to penalize the whole system (using an
>>>>> artificially low-size VA), it would be best to have a way to enable VA
>>>>> limit on per-process basis (similar to ADDR_LIMIT_32BIT personality). If
>>>>> that's not possible -- then on per-container / cgroup basis. If that's
>>>>> not possible -- then on system level (similar to vm.mmap_min_addr, but
>>>>> from the other end).
>>>>>
>>>>> Dear kernel people, what can be done to address the JITs need to
>>>>> reduce effective VA size?
>
>>> On 04/28/2016 09:17 AM, Arnd Bergmann wrote:
>>>> Thanks for the summary, now it all makes much more sense.
>>>>
>>>> One simple (from the kernel's perspective, not from the JIT) approach
>>>> might be to always use MAP_FIXED whenever an allocation is made for
>>>> memory that needs these special pointers, and then manage the available
>>>> address space explicitly. Would that work, or do you require everything
>>>> including the binary itself to be below the address?
>>>>
>>>> Regarding which memory sizes are needed, my impression from your
>>>> explanation is that a single personality flag (e.g. ADDR_LIMIT_42BIT)
>>>> would be sufficient for the usecase, and you don't actually need to
>>>> tie this to the architecture-provided virtual addressing limits
>>>> at all. If it's only one such flag, we can probably find a way to fit
>>>> it into the personality flags, though ironically we are actually
>>>> running out of bits in there as well.
>
>> On 04/28/2016 09:24 AM, Peter Maydell wrote:
>>> The trouble IME with this idea is that in practice you're
>>> linking with glibc, which means glibc is managing (and using)
>>> the address space, not the JIT. So MAP_FIXED is pretty awkward
>>> to use.
>
> On 04/28/2016 03:27 PM, Steve Capper wrote:
>> One can find holes in the VA space by examining /proc/self/maps, thus
>> selection of pointers for MAP_FIXED can be deduced.
>>
>> The other problem is, as Arnd alluded to, if a JIT'ed object needs to
>> then refer to something allocated outside of the JIT. This could be
>> remedied by another level of indirection/trampoline.
>>
>> Taking two steps back though, I would view VA space squeezing as a
>> stop-gap before removing tags from the upper bits of a pointer
>> altogether (tagging the bottom bits, by controlling alignment is
>> perfectly safe). The larger the VA space, the more scope mechanisms
>> such as Address Space Layout Randomisation have to improve security.
>
> I was working on an (AArch64-specific) auxiliary vector entry to export
> TASK_SIZE to userspace at exec time. The goal was to allow for more
> elegant, robust, and efficient replacements for the following changes:
>
> https://hg.mozilla.org/integration/mozilla-inbound/rev/dfaafbaaa291
>
> https://github.com/xemul/criu/commit/c0c0546c31e6df4932669f4740197bb830a24c8d
>
> However based on the above discussion, it appears that some sort of
> prctl(PR_GET_TASK_SIZE, ...) and prctl(PR_SET_TASK_SIZE, ...) may be
> preferable for AArch64. (And perhaps other justifications for the new
> calls influences the x86 decisions.) What do folks think?

I would advocate a slightly different approach:

 - Keep TASK_SIZE either unconditionally matching the hardware or keep
TASK_SIZE as the actual logical split between user and kernel
addresses.  Don't let it change at runtime under any circumstances.
The reason is that there have been plenty of bugs and
overcomplications that result from letting it vary.  For example, if
(addr < TASK_SIZE) really ought to be the correct check (assuming
USER_DS, anyway) for whether dereferencing addr will access user
memory, at least on architectures with a global address space (which
is most of them, I think).

 - If needed, introduce a clean concept of the maximum address that
mmap will return, but don't call it TASK_SIZE.  So, if a user program
wants to limit itself to less than the full hardware VA space (or less
than 63 bits, for that matter), it can.

As an example, a 32-bit x86 program really could have something mapped
above the 32-bit boundary.  It just wouldn't be useful, but the kernel
should still understand that it's *user* memory.

So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.

Also, before getting *too* excited about this kind of VA limit, keep
in mind that SPARC has invented this thingly called "Application Data
Integrity".  It reuses some of the high address bits in hardware for
other purposes.  I wouldn't be totally shocked if other architectures
followed suit. (Although no one should copy SPARC's tagging scheme,
please: it's awful.  these things should be controlled at the MMU
level, not the cache tag level.  Otherwise aliased mappings get very
confused.)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 14:53 ` JITs and 52-bit VA Christopher Covington
  2016-06-22 15:13   ` Andy Lutomirski
@ 2016-06-22 15:40   ` Kirill A. Shutemov
  1 sibling, 0 replies; 15+ messages in thread
From: Kirill A. Shutemov @ 2016-06-22 15:40 UTC (permalink / raw)
  To: Christopher Covington
  Cc: Maxim Kuvyrkov, Linaro Dev Mailman List, Arnd Bergmann,
	Mark Brown, linux-mm, Dmitry Safonov, Andy Lutomirski,
	Cyrill Gorcunov

On Wed, Jun 22, 2016 at 10:53:50AM -0400, Christopher Covington wrote:
> +Andy, Cyrill, Dmitry who have been discussing variable TASK_SIZE on x86
> on linux-mm
> 
> http://marc.info/?l=linux-mm&m=146290118818484&w=2
> 
> >>> On 04/28/2016 09:00 AM, Maxim Kuvyrkov wrote:
> >>>> This is a summary of discussions we had on IRC between kernel and
> >>>> toolchain engineers regarding support for JITs and 52-bit virtual
> >>>> address space (mostly in the context of LuaJIT, but this concerns other
> >>>> JITs too).
> >>>> 
> >>>> The summary is that we need to consider ways of reducing the size of
> >>>> VA for a given process or container on a Linux system.
> >>>> 
> >>>> The high-level problem is that JITs tend to use upper bits of
> >>>> addresses to encode various pieces of data, and that the number of
> >>>> available bits is shrinking due to VA size increasing. With the usual
> >>>> 42-bit VA (which is what most JITs assume) they have 22 bits to encode
> >>>> various performance-critical data. With 48-bit VA (e.g., ThunderX world)
> >>>> things start to get complicated, and JITs need to be non-trivially
> >>>> patched at the source level to continue working with less bits available
> >>>> for their performance-critical storage. With upcoming 52-bit VA things
> >>>> might get dire enough for some JITs to declare such configurations
> >>>> unsupported.
> >>>> 
> >>>> On the other hand, most JITs are not expected to requires terabytes
> >>>> of RAM and huge VA for their applications. Most JIT applications will
> >>>> happily live in 42-bit world with mere 4 terabytes of RAM that it
> >>>> provides. Therefore, what JITs need in the modern world is a way to make
> >>>> mmap() return addresses below a certain threshold, and error out with
> >>>> ENOMEM when "lower" memory is exhausted. This is very similar to
> >>>> ADDR_LIMIT_32BIT personality, but extended to common VA sizes on 64-bit
> >>>> systems: 39-bit, 42-bit, 48-bit, 52-bit, etc.
> >>>> 
> >>>> Since we do not want to penalize the whole system (using an
> >>>> artificially low-size VA), it would be best to have a way to enable VA
> >>>> limit on per-process basis (similar to ADDR_LIMIT_32BIT personality). If
> >>>> that's not possible -- then on per-container / cgroup basis. If that's
> >>>> not possible -- then on system level (similar to vm.mmap_min_addr, but
> >>>> from the other end).
> >>>> 
> >>>> Dear kernel people, what can be done to address the JITs need to
> >>>> reduce effective VA size?

What about, by default, keep applications within known-to-be-safe VA size
and require explicit opt-in for larger one.

The opt-in can be provided in few forms: personality()/prctl() or ELF flag.

I think it's reasonable to set the large-VA ELF flag for newly compiled
binaries (unless specified otherwise). So they can benefit from larger VA
size, but existing binaries woundn't break.
I believe we had something similar for non-executable stack transition.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 15:13   ` Andy Lutomirski
@ 2016-06-22 19:18     ` Cyrill Gorcunov
  2016-06-22 19:20       ` Andy Lutomirski
  2016-06-23  8:20     ` Dmitry Safonov
  1 sibling, 1 reply; 15+ messages in thread
From: Cyrill Gorcunov @ 2016-06-22 19:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopher Covington, Maxim Kuvyrkov, Linaro Dev Mailman List,
	Arnd Bergmann, Mark Brown, linux-mm, Dmitry Safonov

On Wed, Jun 22, 2016 at 08:13:29AM -0700, Andy Lutomirski wrote:
...
> >
> > However based on the above discussion, it appears that some sort of
> > prctl(PR_GET_TASK_SIZE, ...) and prctl(PR_SET_TASK_SIZE, ...) may be
> > preferable for AArch64. (And perhaps other justifications for the new
> > calls influences the x86 decisions.) What do folks think?
> 
> I would advocate a slightly different approach:
> 
>  - Keep TASK_SIZE either unconditionally matching the hardware or keep
> TASK_SIZE as the actual logical split between user and kernel
> addresses.  Don't let it change at runtime under any circumstances.
> The reason is that there have been plenty of bugs and
> overcomplications that result from letting it vary.  For example, if
> (addr < TASK_SIZE) really ought to be the correct check (assuming
> USER_DS, anyway) for whether dereferencing addr will access user
> memory, at least on architectures with a global address space (which
> is most of them, I think).
> 
>  - If needed, introduce a clean concept of the maximum address that
> mmap will return, but don't call it TASK_SIZE.  So, if a user program
> wants to limit itself to less than the full hardware VA space (or less
> than 63 bits, for that matter), it can.
> 
> As an example, a 32-bit x86 program really could have something mapped
> above the 32-bit boundary.  It just wouldn't be useful, but the kernel
> should still understand that it's *user* memory.
> 
> So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.

+1. Also it might be (not sure though, just guessing) suitable to do such
thing via memory cgroup controller, instead of carrying this limit per
each process (or task structure/vma or mm).

> Also, before getting *too* excited about this kind of VA limit, keep
> in mind that SPARC has invented this thingly called "Application Data
> Integrity".  It reuses some of the high address bits in hardware for
> other purposes.  I wouldn't be totally shocked if other architectures
> followed suit. (Although no one should copy SPARC's tagging scheme,
> please: it's awful.  these things should be controlled at the MMU
> level, not the cache tag level.  Otherwise aliased mappings get very
> confused.)

	Cyrill

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 19:18     ` Cyrill Gorcunov
@ 2016-06-22 19:20       ` Andy Lutomirski
  2016-06-22 19:44         ` Cyrill Gorcunov
  2016-06-22 19:56         ` Dave Hansen
  0 siblings, 2 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-06-22 19:20 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Christopher Covington, Maxim Kuvyrkov, Linaro Dev Mailman List,
	Arnd Bergmann, Mark Brown, linux-mm, Dmitry Safonov

On Wed, Jun 22, 2016 at 12:18 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Wed, Jun 22, 2016 at 08:13:29AM -0700, Andy Lutomirski wrote:
> ...
>> >
>> > However based on the above discussion, it appears that some sort of
>> > prctl(PR_GET_TASK_SIZE, ...) and prctl(PR_SET_TASK_SIZE, ...) may be
>> > preferable for AArch64. (And perhaps other justifications for the new
>> > calls influences the x86 decisions.) What do folks think?
>>
>> I would advocate a slightly different approach:
>>
>>  - Keep TASK_SIZE either unconditionally matching the hardware or keep
>> TASK_SIZE as the actual logical split between user and kernel
>> addresses.  Don't let it change at runtime under any circumstances.
>> The reason is that there have been plenty of bugs and
>> overcomplications that result from letting it vary.  For example, if
>> (addr < TASK_SIZE) really ought to be the correct check (assuming
>> USER_DS, anyway) for whether dereferencing addr will access user
>> memory, at least on architectures with a global address space (which
>> is most of them, I think).
>>
>>  - If needed, introduce a clean concept of the maximum address that
>> mmap will return, but don't call it TASK_SIZE.  So, if a user program
>> wants to limit itself to less than the full hardware VA space (or less
>> than 63 bits, for that matter), it can.
>>
>> As an example, a 32-bit x86 program really could have something mapped
>> above the 32-bit boundary.  It just wouldn't be useful, but the kernel
>> should still understand that it's *user* memory.
>>
>> So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.
>
> +1. Also it might be (not sure though, just guessing) suitable to do such
> thing via memory cgroup controller, instead of carrying this limit per
> each process (or task structure/vma or mm).

I think we'll want this per mm.  After all, a high-VA-limit-aware bash
should be able run high-VA-unaware programs without fiddling with
cgroups.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 19:20       ` Andy Lutomirski
@ 2016-06-22 19:44         ` Cyrill Gorcunov
  2016-06-22 20:46           ` Andy Lutomirski
  2016-06-22 19:56         ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Cyrill Gorcunov @ 2016-06-22 19:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopher Covington, Maxim Kuvyrkov, Linaro Dev Mailman List,
	Arnd Bergmann, Mark Brown, linux-mm, Dmitry Safonov

On Wed, Jun 22, 2016 at 12:20:13PM -0700, Andy Lutomirski wrote:
> >>
> >> As an example, a 32-bit x86 program really could have something mapped
> >> above the 32-bit boundary.  It just wouldn't be useful, but the kernel
> >> should still understand that it's *user* memory.
> >>
> >> So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.
> >
> > +1. Also it might be (not sure though, just guessing) suitable to do such
> > thing via memory cgroup controller, instead of carrying this limit per
> > each process (or task structure/vma or mm).
> 
> I think we'll want this per mm.  After all, a high-VA-limit-aware bash
> should be able run high-VA-unaware programs without fiddling with
> cgroups.

Wait. You mean to have some flag in mm struct and consider
its value on mmap call?

	Cyrill

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 19:20       ` Andy Lutomirski
  2016-06-22 19:44         ` Cyrill Gorcunov
@ 2016-06-22 19:56         ` Dave Hansen
  2016-06-22 20:10           ` Cyrill Gorcunov
  2016-06-22 20:17           ` Cyrill Gorcunov
  1 sibling, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2016-06-22 19:56 UTC (permalink / raw)
  To: Andy Lutomirski, Cyrill Gorcunov
  Cc: Christopher Covington, Maxim Kuvyrkov, Linaro Dev Mailman List,
	Arnd Bergmann, Mark Brown, linux-mm, Dmitry Safonov

On 06/22/2016 12:20 PM, Andy Lutomirski wrote:
>>> >> As an example, a 32-bit x86 program really could have something mapped
>>> >> above the 32-bit boundary.  It just wouldn't be useful, but the kernel
>>> >> should still understand that it's *user* memory.
>>> >>
>>> >> So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.
>> >
>> > +1. Also it might be (not sure though, just guessing) suitable to do such
>> > thing via memory cgroup controller, instead of carrying this limit per
>> > each process (or task structure/vma or mm).
> I think we'll want this per mm.  After all, a high-VA-limit-aware bash
> should be able run high-VA-unaware programs without fiddling with
> cgroups.

Yeah, cgroups don't make a lot of sense.

On x86, the 48-bit virtual address is even hard-coded in the ABI[1].  So
we can't change *any* program's layout without either breaking the ABI
or having it opt in.

But, we're also lucky to only have one VA layout since day one.

1. www.x86-64.org/documentation/abi.pdf - a??... Therefore, conforming
processes may only use addresses from 0x00000000 00000000 to 0x00007fff
ffffffff .a??


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 19:56         ` Dave Hansen
@ 2016-06-22 20:10           ` Cyrill Gorcunov
  2016-06-22 20:17           ` Cyrill Gorcunov
  1 sibling, 0 replies; 15+ messages in thread
From: Cyrill Gorcunov @ 2016-06-22 20:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Christopher Covington, Maxim Kuvyrkov,
	Linaro Dev Mailman List, Arnd Bergmann, Mark Brown, linux-mm,
	Dmitry Safonov

On Wed, Jun 22, 2016 at 12:56:56PM -0700, Dave Hansen wrote:
> >> > +1. Also it might be (not sure though, just guessing) suitable to do such
> >> > thing via memory cgroup controller, instead of carrying this limit per
> >> > each process (or task structure/vma or mm).
> > I think we'll want this per mm.  After all, a high-VA-limit-aware bash
> > should be able run high-VA-unaware programs without fiddling with
> > cgroups.
> 
> Yeah, cgroups don't make a lot of sense.

cgroups make sense in terms of shriking data: we only need to
setup the limit once and every process lives in the cgroup
get the limit, no need to carry it per every mm. So I guessed
it might be usefull.

> On x86, the 48-bit virtual address is even hard-coded in the ABI[1].  So
> we can't change *any* program's layout without either breaking the ABI
> or having it opt in.
> 
> But, we're also lucky to only have one VA layout since day one.
> 
> 1. www.x86-64.org/documentation/abi.pdf - a??... Therefore, conforming
> processes may only use addresses from 0x00000000 00000000 to 0x00007fff
> ffffffff .a??

	Cyrill

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 19:56         ` Dave Hansen
  2016-06-22 20:10           ` Cyrill Gorcunov
@ 2016-06-22 20:17           ` Cyrill Gorcunov
  2016-06-22 20:24             ` Kirill A. Shutemov
  2016-06-22 20:41             ` Dave Hansen
  1 sibling, 2 replies; 15+ messages in thread
From: Cyrill Gorcunov @ 2016-06-22 20:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Christopher Covington, Maxim Kuvyrkov,
	Linaro Dev Mailman List, Arnd Bergmann, Mark Brown, linux-mm,
	Dmitry Safonov

On Wed, Jun 22, 2016 at 12:56:56PM -0700, Dave Hansen wrote:
> 
> Yeah, cgroups don't make a lot of sense.
> 
> On x86, the 48-bit virtual address is even hard-coded in the ABI[1].  So
> we can't change *any* program's layout without either breaking the ABI
> or having it opt in.
> 
> But, we're also lucky to only have one VA layout since day one.
> 
> 1. www.x86-64.org/documentation/abi.pdf - a??... Therefore, conforming
> processes may only use addresses from 0x00000000 00000000 to 0x00007fff
> ffffffff .a??

Yes, but noone forces you to write conforming programs ;)
After all while hw allows you to run VA with bits > than
48 it's fine, all side effects of breaking abi is up to
program author (iirc on x86 there is up to 52 bits on
hw level allowed, don't have specs under my hands?)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 20:17           ` Cyrill Gorcunov
@ 2016-06-22 20:24             ` Kirill A. Shutemov
  2016-06-22 20:41             ` Dave Hansen
  1 sibling, 0 replies; 15+ messages in thread
From: Kirill A. Shutemov @ 2016-06-22 20:24 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Dave Hansen, Andy Lutomirski, Christopher Covington,
	Maxim Kuvyrkov, Linaro Dev Mailman List, Arnd Bergmann,
	Mark Brown, linux-mm, Dmitry Safonov

On Wed, Jun 22, 2016 at 11:17:54PM +0300, Cyrill Gorcunov wrote:
> On Wed, Jun 22, 2016 at 12:56:56PM -0700, Dave Hansen wrote:
> > 
> > Yeah, cgroups don't make a lot of sense.
> > 
> > On x86, the 48-bit virtual address is even hard-coded in the ABI[1].  So
> > we can't change *any* program's layout without either breaking the ABI
> > or having it opt in.
> > 
> > But, we're also lucky to only have one VA layout since day one.
> > 
> > 1. www.x86-64.org/documentation/abi.pdf - a??... Therefore, conforming
> > processes may only use addresses from 0x00000000 00000000 to 0x00007fff
> > ffffffff .a??
> 
> Yes, but noone forces you to write conforming programs ;)
> After all while hw allows you to run VA with bits > than
> 48 it's fine, all side effects of breaking abi is up to
> program author (iirc on x86 there is up to 52 bits on
> hw level allowed, don't have specs under my hands?)

Nope. 48-bit VA (47-bit to userspace) and 46-bit PA.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 20:17           ` Cyrill Gorcunov
  2016-06-22 20:24             ` Kirill A. Shutemov
@ 2016-06-22 20:41             ` Dave Hansen
  2016-06-22 21:06               ` Cyrill Gorcunov
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2016-06-22 20:41 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andy Lutomirski, Christopher Covington, Maxim Kuvyrkov,
	Linaro Dev Mailman List, Arnd Bergmann, Mark Brown, linux-mm,
	Dmitry Safonov

On 06/22/2016 01:17 PM, Cyrill Gorcunov wrote:
> On Wed, Jun 22, 2016 at 12:56:56PM -0700, Dave Hansen wrote:
>>
>> Yeah, cgroups don't make a lot of sense.
>>
>> On x86, the 48-bit virtual address is even hard-coded in the ABI[1].  So
>> we can't change *any* program's layout without either breaking the ABI
>> or having it opt in.
>>
>> But, we're also lucky to only have one VA layout since day one.
>>
>> 1. www.x86-64.org/documentation/abi.pdf - a??... Therefore, conforming
>> processes may only use addresses from 0x00000000 00000000 to 0x00007fff
>> ffffffff .a??
> 
> Yes, but noone forces you to write conforming programs ;)
> After all while hw allows you to run VA with bits > than
> 48 it's fine, all side effects of breaking abi is up to
> program author (iirc on x86 there is up to 52 bits on
> hw level allowed, don't have specs under my hands?)

My point was that you can't restrict the vaddr space without breaking
the ABI because apps expect to be able to use 0x00007fffffffffff.  You
also can't extend the vaddr space because apps can *also* expect that
there are no valid vaddrs past 0x00007fffffffffff.

So, whatever happens here, at least on x86, we can't do anything to the
vaddr space without it being an opt-in for *each* *app*.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 19:44         ` Cyrill Gorcunov
@ 2016-06-22 20:46           ` Andy Lutomirski
  2016-06-22 21:38             ` Cyrill Gorcunov
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2016-06-22 20:46 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Christopher Covington, Maxim Kuvyrkov, Linaro Dev Mailman List,
	Arnd Bergmann, Mark Brown, linux-mm, Dmitry Safonov

On Wed, Jun 22, 2016 at 12:44 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Wed, Jun 22, 2016 at 12:20:13PM -0700, Andy Lutomirski wrote:
>> >>
>> >> As an example, a 32-bit x86 program really could have something mapped
>> >> above the 32-bit boundary.  It just wouldn't be useful, but the kernel
>> >> should still understand that it's *user* memory.
>> >>
>> >> So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.
>> >
>> > +1. Also it might be (not sure though, just guessing) suitable to do such
>> > thing via memory cgroup controller, instead of carrying this limit per
>> > each process (or task structure/vma or mm).
>>
>> I think we'll want this per mm.  After all, a high-VA-limit-aware bash
>> should be able run high-VA-unaware programs without fiddling with
>> cgroups.
>
> Wait. You mean to have some flag in mm struct and consider
> its value on mmap call?

Exactly.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 20:41             ` Dave Hansen
@ 2016-06-22 21:06               ` Cyrill Gorcunov
  0 siblings, 0 replies; 15+ messages in thread
From: Cyrill Gorcunov @ 2016-06-22 21:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Christopher Covington, Maxim Kuvyrkov,
	Linaro Dev Mailman List, Arnd Bergmann, Mark Brown, linux-mm,
	Dmitry Safonov

On Wed, Jun 22, 2016 at 01:41:40PM -0700, Dave Hansen wrote:
> >>
> >> Yeah, cgroups don't make a lot of sense.
> >>
> >> On x86, the 48-bit virtual address is even hard-coded in the ABI[1].  So
> >> we can't change *any* program's layout without either breaking the ABI
> >> or having it opt in.
> >>
> >> But, we're also lucky to only have one VA layout since day one.
> >>
> >> 1. www.x86-64.org/documentation/abi.pdf - a??... Therefore, conforming
> >> processes may only use addresses from 0x00000000 00000000 to 0x00007fff
> >> ffffffff .a??
> > 
> > Yes, but noone forces you to write conforming programs ;)
> > After all while hw allows you to run VA with bits > than
> > 48 it's fine, all side effects of breaking abi is up to
> > program author (iirc on x86 there is up to 52 bits on
> > hw level allowed, don't have specs under my hands?)
> 
> My point was that you can't restrict the vaddr space without breaking
> the ABI because apps expect to be able to use 0x00007fffffffffff.  You
> also can't extend the vaddr space because apps can *also* expect that
> there are no valid vaddrs past 0x00007fffffffffff.
> 
> So, whatever happens here, at least on x86, we can't do anything to the
> vaddr space without it being an opt-in for *each* *app*.

The main problem is not abi on its own, because the abi stands for
"conforming" programs, if this feature will be controlled by some
flag in mm struct (or cgroup, of whatever) it won't affect regular
programs which follow the abi. But the problem is that if we allow
this wide addresses right now may not we clash with future abi extensions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 20:46           ` Andy Lutomirski
@ 2016-06-22 21:38             ` Cyrill Gorcunov
  0 siblings, 0 replies; 15+ messages in thread
From: Cyrill Gorcunov @ 2016-06-22 21:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christopher Covington, Maxim Kuvyrkov, Linaro Dev Mailman List,
	Arnd Bergmann, Mark Brown, linux-mm, Dmitry Safonov

On Wed, Jun 22, 2016 at 01:46:18PM -0700, Andy Lutomirski wrote:
> >>
> >> I think we'll want this per mm.  After all, a high-VA-limit-aware bash
> >> should be able run high-VA-unaware programs without fiddling with
> >> cgroups.
> >
> > Wait. You mean to have some flag in mm struct and consider
> > its value on mmap call?
> 
> Exactly.

I see. Thanks for info!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: JITs and 52-bit VA
  2016-06-22 15:13   ` Andy Lutomirski
  2016-06-22 19:18     ` Cyrill Gorcunov
@ 2016-06-23  8:20     ` Dmitry Safonov
  1 sibling, 0 replies; 15+ messages in thread
From: Dmitry Safonov @ 2016-06-23  8:20 UTC (permalink / raw)
  To: Andy Lutomirski, Christopher Covington
  Cc: Maxim Kuvyrkov, Linaro Dev Mailman List, Arnd Bergmann,
	Mark Brown, linux-mm, Cyrill Gorcunov

On 06/22/2016 06:13 PM, Andy Lutomirski wrote:
> On Wed, Jun 22, 2016 at 7:53 AM, Christopher Covington
> <cov@codeaurora.org> wrote:
>> +Andy, Cyrill, Dmitry who have been discussing variable TASK_SIZE on x86
>> on linux-mm
>>
>> http://marc.info/?l=linux-mm&m=146290118818484&w=2
>>
>>
>> I was working on an (AArch64-specific) auxiliary vector entry to export
>> TASK_SIZE to userspace at exec time. The goal was to allow for more
>> elegant, robust, and efficient replacements for the following changes:
>>
>> https://hg.mozilla.org/integration/mozilla-inbound/rev/dfaafbaaa291
>>
>> https://github.com/xemul/criu/commit/c0c0546c31e6df4932669f4740197bb830a24c8d
>>
>> However based on the above discussion, it appears that some sort of
>> prctl(PR_GET_TASK_SIZE, ...) and prctl(PR_SET_TASK_SIZE, ...) may be
>> preferable for AArch64. (And perhaps other justifications for the new
>> calls influences the x86 decisions.) What do folks think?
>
> I would advocate a slightly different approach:
>
>  - Keep TASK_SIZE either unconditionally matching the hardware or keep
> TASK_SIZE as the actual logical split between user and kernel
> addresses.  Don't let it change at runtime under any circumstances.
> The reason is that there have been plenty of bugs and
> overcomplications that result from letting it vary.  For example, if
> (addr < TASK_SIZE) really ought to be the correct check (assuming
> USER_DS, anyway) for whether dereferencing addr will access user
> memory, at least on architectures with a global address space (which
> is most of them, I think).
>
>  - If needed, introduce a clean concept of the maximum address that
> mmap will return, but don't call it TASK_SIZE.  So, if a user program
> wants to limit itself to less than the full hardware VA space (or less
> than 63 bits, for that matter), it can.
>
> As an example, a 32-bit x86 program really could have something mapped
> above the 32-bit boundary.  It just wouldn't be useful, but the kernel
> should still understand that it's *user* memory.
>
> So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.

I like to agree -- this approach seems clear.
It also complements your idea of unifying TASK_SIZE for x86 and leaving
only ADDR_LIMIT_32BIT setting with personality()


> Also, before getting *too* excited about this kind of VA limit, keep
> in mind that SPARC has invented this thingly called "Application Data
> Integrity".

Thanks for the link -- what a good thing. I dream it could work not on
per-page basis, heh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-06-23  8:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4A8E6E6D-6CF7-4964-A62E-467AE287D415@linaro.org>
2016-06-22 14:53 ` JITs and 52-bit VA Christopher Covington
2016-06-22 15:13   ` Andy Lutomirski
2016-06-22 19:18     ` Cyrill Gorcunov
2016-06-22 19:20       ` Andy Lutomirski
2016-06-22 19:44         ` Cyrill Gorcunov
2016-06-22 20:46           ` Andy Lutomirski
2016-06-22 21:38             ` Cyrill Gorcunov
2016-06-22 19:56         ` Dave Hansen
2016-06-22 20:10           ` Cyrill Gorcunov
2016-06-22 20:17           ` Cyrill Gorcunov
2016-06-22 20:24             ` Kirill A. Shutemov
2016-06-22 20:41             ` Dave Hansen
2016-06-22 21:06               ` Cyrill Gorcunov
2016-06-23  8:20     ` Dmitry Safonov
2016-06-22 15:40   ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.