linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 2/3] MAP_NOZERO - implement sys_brk2()
@ 2007-06-27  2:44 Davide Libenzi
  2007-06-27  3:07 ` Rik van Riel
  2007-06-27  3:48 ` Ulrich Drepper
  0 siblings, 2 replies; 25+ messages in thread
From: Davide Libenzi @ 2007-06-27  2:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List

The following patch implements the sys_brk2() syscall, that nothing is
other than a sys_brk() with an extra "flags" parameter. This can be used
to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero
pages if possible.



Signed-off-by: Davide Libenzi <davidel@xmailserver.org>


- Davide


---
 include/linux/mm.h       |    3 ++-
 include/linux/syscalls.h |    1 +
 mm/mmap.c                |   22 ++++++++++++++++++----
 3 files changed, 21 insertions(+), 5 deletions(-)

Index: linux-2.6.mod/include/linux/mm.h
===================================================================
--- linux-2.6.mod.orig/include/linux/mm.h	2007-06-25 19:27:42.000000000 -0700
+++ linux-2.6.mod/include/linux/mm.h	2007-06-26 18:08:28.000000000 -0700
@@ -1099,7 +1099,8 @@
 }
 
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
-
+extern unsigned long do_brk_flags(unsigned long addr, unsigned long len,
+				  unsigned long vmflags);
 extern unsigned long do_brk(unsigned long, unsigned long);
 
 /* filemap.c */
Index: linux-2.6.mod/include/linux/syscalls.h
===================================================================
--- linux-2.6.mod.orig/include/linux/syscalls.h	2007-06-25 19:14:49.000000000 -0700
+++ linux-2.6.mod/include/linux/syscalls.h	2007-06-26 18:08:28.000000000 -0700
@@ -263,6 +263,7 @@
 asmlinkage long sys_fremovexattr(int fd, char __user *name);
 
 asmlinkage unsigned long sys_brk(unsigned long brk);
+asmlinkage unsigned long sys_brk2(unsigned long brk, unsigned long flags);
 asmlinkage long sys_mprotect(unsigned long start, size_t len,
 				unsigned long prot);
 asmlinkage unsigned long sys_mremap(unsigned long addr,
Index: linux-2.6.mod/mm/mmap.c
===================================================================
--- linux-2.6.mod.orig/mm/mmap.c	2007-06-25 19:14:49.000000000 -0700
+++ linux-2.6.mod/mm/mmap.c	2007-06-26 18:08:28.000000000 -0700
@@ -35,6 +35,8 @@
 #define arch_mmap_check(addr, len, flags)	(0)
 #endif
 
+#define BRK_ALLOWED_FLAGS	(VM_NOZERO)
+
 static void unmap_region(struct mm_struct *mm,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		unsigned long start, unsigned long end);
@@ -234,7 +236,7 @@
 	return next;
 }
 
-asmlinkage unsigned long sys_brk(unsigned long brk)
+asmlinkage unsigned long sys_brk2(unsigned long brk, unsigned long flags)
 {
 	unsigned long rlim, retval;
 	unsigned long newbrk, oldbrk;
@@ -271,8 +273,10 @@
 	if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))
 		goto out;
 
+	flags = BRK_ALLOWED_FLAGS & calc_vm_flag_bits(flags);
+
 	/* Ok, looks good - let it rip. */
-	if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk)
+	if (do_brk_flags(oldbrk, newbrk-oldbrk, flags) != oldbrk)
 		goto out;
 set_brk:
 	mm->brk = brk;
@@ -282,6 +286,11 @@
 	return retval;
 }
 
+asmlinkage unsigned long sys_brk(unsigned long brk)
+{
+	return sys_brk2(brk, 0);
+}
+
 #ifdef DEBUG_MM_RB
 static int browse_rb(struct rb_root *root)
 {
@@ -1863,7 +1872,8 @@
  *  anonymous maps.  eventually we may be able to do some
  *  brk-specific accounting here.
  */
-unsigned long do_brk(unsigned long addr, unsigned long len)
+unsigned long do_brk_flags(unsigned long addr, unsigned long len,
+			   unsigned long vmflags)
 {
 	struct mm_struct * mm = current->mm;
 	struct vm_area_struct * vma, * prev;
@@ -1882,7 +1892,7 @@
 	if (is_hugepage_only_range(mm, addr, len))
 		return -EINVAL;
 
-	flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
+	flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags | vmflags;
 
 	error = arch_mmap_check(addr, len, flags);
 	if (error)
@@ -1959,6 +1969,10 @@
 	return addr;
 }
 
+unsigned long do_brk(unsigned long addr, unsigned long len)
+{
+	return do_brk_flags(addr, len, 0);
+}
 EXPORT_SYMBOL(do_brk);
 
 /* Release all mmaps. */


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  2:44 [patch 2/3] MAP_NOZERO - implement sys_brk2() Davide Libenzi
@ 2007-06-27  3:07 ` Rik van Riel
  2007-06-27  3:33   ` Davide Libenzi
  2007-06-27  3:45   ` Ulrich Drepper
  2007-06-27  3:48 ` Ulrich Drepper
  1 sibling, 2 replies; 25+ messages in thread
From: Rik van Riel @ 2007-06-27  3:07 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List

Davide Libenzi wrote:
> The following patch implements the sys_brk2() syscall, that nothing is
> other than a sys_brk() with an extra "flags" parameter. This can be used
> to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero
> pages if possible.

Since programs can get back free()d memory after a malloc(),
with the old contents of the memory intact, surely your
MAP_NONZERO behavior could be the default for programs that
can get away with it?

Maybe we could use some magic ELF header, similar to the
way non-executable stack is handled?

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  3:07 ` Rik van Riel
@ 2007-06-27  3:33   ` Davide Libenzi
  2007-06-27  3:45   ` Ulrich Drepper
  1 sibling, 0 replies; 25+ messages in thread
From: Davide Libenzi @ 2007-06-27  3:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linux Kernel Mailing List

On Tue, 26 Jun 2007, Rik van Riel wrote:

> Davide Libenzi wrote:
> > The following patch implements the sys_brk2() syscall, that nothing is
> > other than a sys_brk() with an extra "flags" parameter. This can be used
> > to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero
> > pages if possible.
> 
> Since programs can get back free()d memory after a malloc(),
> with the old contents of the memory intact, surely your
> MAP_NONZERO behavior could be the default for programs that
> can get away with it?
> 
> Maybe we could use some magic ELF header, similar to the
> way non-executable stack is handled?

Well, the quick&ugly glibc patch simply uses an environment variable, just 
because I wanted to bench the kernel build with using the same glibc+gcc.
Yes, it can be the default behaviour for the allocator. The patch handles 
calloc() correctly, by forcibly zeroing memory in such calls.
But other software must be taught too, to use MAP_NOZERO when they do not 
need zeroed memory. I did that for the gcc garbage collector.



- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  3:07 ` Rik van Riel
  2007-06-27  3:33   ` Davide Libenzi
@ 2007-06-27  3:45   ` Ulrich Drepper
  2007-06-27  4:11     ` Rik van Riel
  1 sibling, 1 reply; 25+ messages in thread
From: Ulrich Drepper @ 2007-06-27  3:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Davide Libenzi, Linux Kernel Mailing List

On 6/26/07, Rik van Riel <riel@redhat.com> wrote:
> Since programs can get back free()d memory after a malloc(),
> with the old contents of the memory intact, surely your
> MAP_NONZERO behavior could be the default for programs that
> can get away with it?
>
> Maybe we could use some magic ELF header, similar to the
> way non-executable stack is handled?

No.  This is an implementation detail of the libc version.  The malloc
as compiled today is expecting brk-ed memory to be zeroed.  This
default can of course be changed (it's a simple define) but you cannot
make this the default behavior for brk.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  2:44 [patch 2/3] MAP_NOZERO - implement sys_brk2() Davide Libenzi
  2007-06-27  3:07 ` Rik van Riel
@ 2007-06-27  3:48 ` Ulrich Drepper
  2007-06-27  3:55   ` Davide Libenzi
  1 sibling, 1 reply; 25+ messages in thread
From: Ulrich Drepper @ 2007-06-27  3:48 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List

On 6/26/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> The following patch implements the sys_brk2() syscall, that nothing is
> other than a sys_brk() with an extra "flags" parameter.

Shouldn't we wait for Linus' sys_indirect to arrive and make this
another syscall which takes advantage of it?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  3:48 ` Ulrich Drepper
@ 2007-06-27  3:55   ` Davide Libenzi
  2007-06-27  5:02     ` Ulrich Drepper
  0 siblings, 1 reply; 25+ messages in thread
From: Davide Libenzi @ 2007-06-27  3:55 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel Mailing List

On Tue, 26 Jun 2007, Ulrich Drepper wrote:

> On 6/26/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> > The following patch implements the sys_brk2() syscall, that nothing is
> > other than a sys_brk() with an extra "flags" parameter.
> 
> Shouldn't we wait for Linus' sys_indirect to arrive and make this
> another syscall which takes advantage of it?

I acutally have the code for it, but I never posted it since it did not 
receive a too warm review (and the only user was the fdmap thingy).
OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence 
brk2() would not be needed, no?



- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  3:45   ` Ulrich Drepper
@ 2007-06-27  4:11     ` Rik van Riel
  2007-06-27  5:04       ` Ulrich Drepper
  0 siblings, 1 reply; 25+ messages in thread
From: Rik van Riel @ 2007-06-27  4:11 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Davide Libenzi, Linux Kernel Mailing List

Ulrich Drepper wrote:
> On 6/26/07, Rik van Riel <riel@redhat.com> wrote:
>> Since programs can get back free()d memory after a malloc(),
>> with the old contents of the memory intact, surely your
>> MAP_NONZERO behavior could be the default for programs that
>> can get away with it?
>>
>> Maybe we could use some magic ELF header, similar to the
>> way non-executable stack is handled?
> 
> No.  This is an implementation detail of the libc version.  The malloc
> as compiled today is expecting brk-ed memory to be zeroed.  This
> default can of course be changed (it's a simple define) but you cannot
> make this the default behavior for brk.

After going through the first malloc()/free() cycle, surely
the memory will no longer be zeroed on the second malloc() ?

What makes the first brk malloc so special?

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  3:55   ` Davide Libenzi
@ 2007-06-27  5:02     ` Ulrich Drepper
  2007-06-27 12:32       ` Hugh Dickins
  2007-06-27 16:05       ` Davide Libenzi
  0 siblings, 2 replies; 25+ messages in thread
From: Ulrich Drepper @ 2007-06-27  5:02 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List

On 6/26/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> I acutally have the code for it, but I never posted it since it did not
> receive a too warm review (and the only user was the fdmap thingy).

Only user of sys_indirect?  There will be quite a few right away.
Every syscall that returns a file descriptor needs O_CLOEXEC support
(socket, pipe, epoll_create, ...)


> OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence
> brk2() would not be needed, no?

No.  mmap calls create individual VMAs which gets expensive.  There
are also some hardware drivers which get more expensive the more VMAs
there are.  I want to go away as much as possible from mmap for
malloc.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  4:11     ` Rik van Riel
@ 2007-06-27  5:04       ` Ulrich Drepper
  0 siblings, 0 replies; 25+ messages in thread
From: Ulrich Drepper @ 2007-06-27  5:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Davide Libenzi, Linux Kernel Mailing List

On 6/26/07, Rik van Riel <riel@redhat.com> wrote:
> After going through the first malloc()/free() cycle, surely
> the memory will no longer be zeroed on the second malloc() ?

If returned to the system, sure.


> What makes the first brk malloc so special?

If the memory is zeroed it needs not be initialized by malloc.  No
calloc zeroing, no pointer clearing.

Anyway, it's irrelevant what the benefits are, the fact is current
code depends on brk to zero the memory and you'd break the ABI if
you'd change it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  5:02     ` Ulrich Drepper
@ 2007-06-27 12:32       ` Hugh Dickins
  2007-06-27 15:59         ` Davide Libenzi
  2007-06-27 17:01         ` Ulrich Drepper
  2007-06-27 16:05       ` Davide Libenzi
  1 sibling, 2 replies; 25+ messages in thread
From: Hugh Dickins @ 2007-06-27 12:32 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Davide Libenzi, Linux Kernel Mailing List

On Tue, 26 Jun 2007, Ulrich Drepper wrote:
> On 6/26/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> 
> > OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence
> > brk2() would not be needed, no?
> 
> No.  mmap calls create individual VMAs which gets expensive.  There
> are also some hardware drivers which get more expensive the more VMAs
> there are.  I want to go away as much as possible from mmap for
> malloc.

Not so: if an mmap can be done by extending either adjacent vma (prot
and flags and file and offset all match up), that's what's done and no
separate vma is created.  (And adjacent vmas get merged when mprotect
removes the difference in protection.)

I don't think there's any such reason to prefer brk to mmap.  But you
may have encountered something which we in the kernel are thinking of
as an insignificant corner case, which is actually breaking things up
badly in practice (I recall the kernel's internal VM_ACCOUNT flag,
relating to non-overcommit accounting, which might get turned on at
any time, sometimes preventing a vma merge you'd otherwise expect).
Please let me know if you've a test case which shows more vmas than
expected.

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 12:32       ` Hugh Dickins
@ 2007-06-27 15:59         ` Davide Libenzi
  2007-06-27 17:01         ` Ulrich Drepper
  1 sibling, 0 replies; 25+ messages in thread
From: Davide Libenzi @ 2007-06-27 15:59 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Ulrich Drepper, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Hugh Dickins wrote:

> On Tue, 26 Jun 2007, Ulrich Drepper wrote:
> > On 6/26/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> > 
> > > OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence
> > > brk2() would not be needed, no?
> > 
> > No.  mmap calls create individual VMAs which gets expensive.  There
> > are also some hardware drivers which get more expensive the more VMAs
> > there are.  I want to go away as much as possible from mmap for
> > malloc.
> 
> Not so: if an mmap can be done by extending either adjacent vma (prot
> and flags and file and offset all match up), that's what's done and no
> separate vma is created.  (And adjacent vmas get merged when mprotect
> removes the difference in protection.)
> 
> I don't think there's any such reason to prefer brk to mmap.  But you
> may have encountered something which we in the kernel are thinking of
> as an insignificant corner case, which is actually breaking things up
> badly in practice (I recall the kernel's internal VM_ACCOUNT flag,
> relating to non-overcommit accounting, which might get turned on at
> any time, sometimes preventing a vma merge you'd otherwise expect).
> Please let me know if you've a test case which shows more vmas than
> expected.

The only way I can see vma fragmentation happen in that case, is if 
userspace uses a mixture of mmaps and mallocs, and flags+prots of the two 
does not match. The glibc allocator seems to support it just fine. There's 
a macro where you specify if the heaps are contiguous or not.


- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27  5:02     ` Ulrich Drepper
  2007-06-27 12:32       ` Hugh Dickins
@ 2007-06-27 16:05       ` Davide Libenzi
  1 sibling, 0 replies; 25+ messages in thread
From: Davide Libenzi @ 2007-06-27 16:05 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel Mailing List

On Tue, 26 Jun 2007, Ulrich Drepper wrote:

> On 6/26/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> > I acutally have the code for it, but I never posted it since it did not
> > receive a too warm review (and the only user was the fdmap thingy).
> 
> Only user of sys_indirect?  There will be quite a few right away.
> Every syscall that returns a file descriptor needs O_CLOEXEC support
> (socket, pipe, epoll_create, ...)

Ok, I'll try to add flags support to those "fd generators" over the code I 
have. That is only i386 nd x86-64 ATM (sys_indirect requires a thin asm 
wrapper - gcc asm will not do it for me).



- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 12:32       ` Hugh Dickins
  2007-06-27 15:59         ` Davide Libenzi
@ 2007-06-27 17:01         ` Ulrich Drepper
  2007-06-27 17:43           ` Hugh Dickins
  2007-06-27 19:00           ` Rik van Riel
  1 sibling, 2 replies; 25+ messages in thread
From: Ulrich Drepper @ 2007-06-27 17:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Davide Libenzi, Linux Kernel Mailing List

On 6/27/07, Hugh Dickins <hugh@veritas.com> wrote:
> Not so: if an mmap can be done by extending either adjacent vma (prot
> and flags and file and offset all match up), that's what's done and no
> separate vma is created.  (And adjacent vmas get merged when mprotect
> removes the difference in protection.)

mmap return values are randomized.  If they would be mergable
something would be wrong.


> I don't think there's any such reason to prefer brk to mmap.

Talk to the Quadrics people.  Some of their interconnect adapters
incur significant costs with separate VMAs.


> Please let me know if you've a test case which shows more vmas than
> expected.

Not related to this, but we already have "too many" VMAs for some
definition of "too many".  Since we split VMA when the protection
changes each thread stack consists at least of two VMA (three for
ia64).  There was an approach at some point to push the access flags
down and allow VMAs to stretch further.  Why was this deemed
unsuitable?  It could have quite a bit with situations with many
threads (Java).  As I've told back when this came up, we had one
customer where the VMA lookup actually showed up in profiles.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 17:01         ` Ulrich Drepper
@ 2007-06-27 17:43           ` Hugh Dickins
  2007-06-27 18:13             ` Davide Libenzi
  2007-06-27 18:52             ` Ulrich Drepper
  2007-06-27 19:00           ` Rik van Riel
  1 sibling, 2 replies; 25+ messages in thread
From: Hugh Dickins @ 2007-06-27 17:43 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Davide Libenzi, blaisorblade, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Ulrich Drepper wrote:
> On 6/27/07, Hugh Dickins <hugh@veritas.com> wrote:
> > Not so: if an mmap can be done by extending either adjacent vma (prot
> > and flags and file and offset all match up), that's what's done and no
> > separate vma is created.  (And adjacent vmas get merged when mprotect
> > removes the difference in protection.)
> 
> mmap return values are randomized.  If they would be mergable
> something would be wrong.

The absolute virtual addresses are randomized, yes; but do a sequence
of mmaps and they come back adjacent to each other, and so mergable.
Or does your distro include a kernel patch to randomize them relative
to each other?

> > I don't think there's any such reason to prefer brk to mmap.
> 
> Talk to the Quadrics people.  Some of their interconnect adapters
> incur significant costs with separate VMAs.
> 
> > Please let me know if you've a test case which shows more vmas than
> > expected.
> 
> Not related to this, but we already have "too many" VMAs for some
> definition of "too many".  Since we split VMA when the protection
> changes each thread stack consists at least of two VMA (three for
> ia64).  There was an approach at some point to push the access flags
> down and allow VMAs to stretch further.  Why was this deemed
> unsuitable?

Blaisorblade's patch, yes.  I disliked it at the time I reviewed
it a couple of years back: it made a system call many of us regret
even nastier, and relied on tiresome per-arch pagetable encodings.
It seemed likely to cause us trouble down the line - though
undoubtedly it provides real benefits too.

I never found time to review it when it came around again recently,
thought Blaisorblade might prefer me to turn a blind eye to it ;)
and let others opine instead.  But I think noone else found time to
review it either.  I expect it's rather nicer now, but doubt it's
actually attractive.

It _might_ turn out to be more attractive, not to rely on that
peculiar sys_remap_file_pages, but make all our vmas independent
of protections, and hang differently protected ranges off them.
Maybe.

> It could have quite a bit with situations with many
> threads (Java).  As I've told back when this came up, we had one
> customer where the VMA lookup actually showed up in profiles.

In honesty, I should add that I dislike and distrust Davide's
MAP_NOZERO very much indeed!  Would much rather leave my cpus
spending a little time in clear_page().  A uid in struct page
(though I'm sure we could find somewhere to tuck it away) -
the horror, the horror!  But I've so far failed to find a killer
argument against it, and am hoping for someone else to do so.

Curmudgeonly,
Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 17:43           ` Hugh Dickins
@ 2007-06-27 18:13             ` Davide Libenzi
  2007-06-27 18:32               ` Hugh Dickins
  2007-06-27 18:52             ` Ulrich Drepper
  1 sibling, 1 reply; 25+ messages in thread
From: Davide Libenzi @ 2007-06-27 18:13 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Ulrich Drepper, blaisorblade, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Hugh Dickins wrote:

> In honesty, I should add that I dislike and distrust Davide's
> MAP_NOZERO very much indeed!  Would much rather leave my cpus
> spending a little time in clear_page().  A uid in struct page
> (though I'm sure we could find somewhere to tuck it away) -
> the horror, the horror!  But I've so far failed to find a killer
> argument against it, and am hoping for someone else to do so.

Little time? Please, do not trust me. Start oprofile and run a kernel 
build. Look, I'm not even talking about som micro benchmark explicitly 
built to exploit the thing. A kernel build.
You will find clear_page to be the *1st* kernel entry after cc1 and as.
That is bad for two reasons. The time it spends in there, and the cache it 
blows.



- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 18:13             ` Davide Libenzi
@ 2007-06-27 18:32               ` Hugh Dickins
  2007-06-27 18:45                 ` Davide Libenzi
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2007-06-27 18:32 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Ulrich Drepper, blaisorblade, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Davide Libenzi wrote:
> On Wed, 27 Jun 2007, Hugh Dickins wrote:
> 
> > In honesty, I should add that I dislike and distrust Davide's
> > MAP_NOZERO very much indeed!  Would much rather leave my cpus
> > spending a little time in clear_page().  A uid in struct page
> > (though I'm sure we could find somewhere to tuck it away) -
> > the horror, the horror!  But I've so far failed to find a killer
> > argument against it, and am hoping for someone else to do so.
> 
> Little time? Please, do not trust me. Start oprofile and run a kernel 
> build. Look, I'm not even talking about som micro benchmark explicitly 
> built to exploit the thing. A kernel build.
> You will find clear_page to be the *1st* kernel entry after cc1 and as.
> That is bad for two reasons. The time it spends in there, and the cache it 
> blows.

I don't doubt that it shows real benefits; but dangerously cutting
corners usually shows benefits too.  Relying on a uid at this level
feels very wrong to me - but as I said, I've not found a killer
argument against it.

And we both know that clear_page features so high in part because
it's bringing cachelines in from the cold, which are about to be
accessed again by userspace; so it's often not so bad as it appears.

Though I probably wouldn't be citing that argument if we were
talking about offloading clear_page to another engine.

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 18:32               ` Hugh Dickins
@ 2007-06-27 18:45                 ` Davide Libenzi
  2007-06-27 22:11                   ` Nicholas Miell
  0 siblings, 1 reply; 25+ messages in thread
From: Davide Libenzi @ 2007-06-27 18:45 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Ulrich Drepper, blaisorblade, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Hugh Dickins wrote:

> On Wed, 27 Jun 2007, Davide Libenzi wrote:
> > On Wed, 27 Jun 2007, Hugh Dickins wrote:
> > 
> > > In honesty, I should add that I dislike and distrust Davide's
> > > MAP_NOZERO very much indeed!  Would much rather leave my cpus
> > > spending a little time in clear_page().  A uid in struct page
> > > (though I'm sure we could find somewhere to tuck it away) -
> > > the horror, the horror!  But I've so far failed to find a killer
> > > argument against it, and am hoping for someone else to do so.
> > 
> > Little time? Please, do not trust me. Start oprofile and run a kernel 
> > build. Look, I'm not even talking about som micro benchmark explicitly 
> > built to exploit the thing. A kernel build.
> > You will find clear_page to be the *1st* kernel entry after cc1 and as.
> > That is bad for two reasons. The time it spends in there, and the cache it 
> > blows.
> 
> I don't doubt that it shows real benefits; but dangerously cutting
> corners usually shows benefits too.  Relying on a uid at this level
> feels very wrong to me - but as I said, I've not found a killer
> argument against it.

The reason why I posted is exactly so other ppl can look at it and find 
possible flaws in the way pages and retired. If an effective UID was able 
to see (or it generated) the data on that page, it should be able to get 
that page back uncleared (when VM_NOZERO is set).
>From a performance POV, a 2-3% boost on a non-micro-bench test like a 
kernel build is not exaclty peanuts. And for more heavy malloc/anon-mmap 
appliations, the boost goes up to 10-15%. That is not exactly what I 
call "little time" ;)



- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 17:43           ` Hugh Dickins
  2007-06-27 18:13             ` Davide Libenzi
@ 2007-06-27 18:52             ` Ulrich Drepper
  2007-06-27 19:32               ` Hugh Dickins
  1 sibling, 1 reply; 25+ messages in thread
From: Ulrich Drepper @ 2007-06-27 18:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Davide Libenzi, blaisorblade, Linux Kernel Mailing List

On 6/27/07, Hugh Dickins <hugh@veritas.com> wrote:
> The absolute virtual addresses are randomized, yes; but do a sequence
> of mmaps and they come back adjacent to each other, and so mergable.
> Or does your distro include a kernel patch to randomize them relative
> to each other?

Each individual mmap is supposed to be randomized, yes.  If this
doesn't happen in one of our kernels right now something broken.  You
wouldn't have effective ASLR if all relative address remain the same.


> It _might_ turn out to be more attractive, not to rely on that
> peculiar sys_remap_file_pages, but make all our vmas independent
> of protections, and hang differently protected ranges off them.
> Maybe.

That's what I think is done or at least should be done.

If you want to be shocked, look at some really big Java apps.
Hundreds or thousands of threads, lots of mmap allocation.  We might
have 10,000 VMAs.  Searching becomes a problem and if the protection
information be stored somewhere else and you safe the VMA data
structures there is even some memory saving possible.

I definitely thing that this is an area which warrants looking at.  We
haven't yet seen the worst of VMA usage.  The move to 64-bit is only
just beginning and wait what people think they can do with 48+ bits of
address space.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 17:01         ` Ulrich Drepper
  2007-06-27 17:43           ` Hugh Dickins
@ 2007-06-27 19:00           ` Rik van Riel
  2007-06-27 19:22             ` Davide Libenzi
  1 sibling, 1 reply; 25+ messages in thread
From: Rik van Riel @ 2007-06-27 19:00 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Hugh Dickins, Davide Libenzi, Linux Kernel Mailing List

Ulrich Drepper wrote:
> On 6/27/07, Hugh Dickins <hugh@veritas.com> wrote:
>> Not so: if an mmap can be done by extending either adjacent vma (prot
>> and flags and file and offset all match up), that's what's done and no
>> separate vma is created.  (And adjacent vmas get merged when mprotect
>> removes the difference in protection.)
> 
> mmap return values are randomized.  If they would be mergable
> something would be wrong.

You can easily pass a hint address to mmap(), causing it to
extend a malloc arena instead of creating a new VMA.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 19:00           ` Rik van Riel
@ 2007-06-27 19:22             ` Davide Libenzi
  0 siblings, 0 replies; 25+ messages in thread
From: Davide Libenzi @ 2007-06-27 19:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Ulrich Drepper, Hugh Dickins, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Rik van Riel wrote:

> Ulrich Drepper wrote:
> > On 6/27/07, Hugh Dickins <hugh@veritas.com> wrote:
> > > Not so: if an mmap can be done by extending either adjacent vma (prot
> > > and flags and file and offset all match up), that's what's done and no
> > > separate vma is created.  (And adjacent vmas get merged when mprotect
> > > removes the difference in protection.)
> > 
> > mmap return values are randomized.  If they would be mergable
> > something would be wrong.
> 
> You can easily pass a hint address to mmap(), causing it to
> extend a malloc arena instead of creating a new VMA.

You mean something like the F_DUPFD, where you pass an fd and that serves 
as from-there-up hint?



- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 18:52             ` Ulrich Drepper
@ 2007-06-27 19:32               ` Hugh Dickins
  0 siblings, 0 replies; 25+ messages in thread
From: Hugh Dickins @ 2007-06-27 19:32 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Davide Libenzi, blaisorblade, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Ulrich Drepper wrote:
> On 6/27/07, Hugh Dickins <hugh@veritas.com> wrote:
> > The absolute virtual addresses are randomized, yes; but do a sequence
> > of mmaps and they come back adjacent to each other, and so mergable.
> > Or does your distro include a kernel patch to randomize them relative
> > to each other?
> 
> Each individual mmap is supposed to be randomized, yes.  If this
> doesn't happen in one of our kernels right now something broken.  You
> wouldn't have effective ASLR if all relative address remain the same.

Ah, I guess it's an exec_shield thing: vanilla kernel gives us
adjacent mmaps which are mergable.  As Rik suggest, to make your
mmaps adjacent, use the mmap(addr,...) hint.  (But please don't
then complain that they're not relatively randomized ;)

> > It _might_ turn out to be more attractive, not to rely on that
> > peculiar sys_remap_file_pages, but make all our vmas independent
> > of protections, and hang differently protected ranges off them.
> > Maybe.
> 
> That's what I think is done or at least should be done.

Noted.

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 18:45                 ` Davide Libenzi
@ 2007-06-27 22:11                   ` Nicholas Miell
  2007-06-28  0:17                     ` Davide Libenzi
  2007-06-30  7:52                     ` Andrew Morton
  0 siblings, 2 replies; 25+ messages in thread
From: Nicholas Miell @ 2007-06-27 22:11 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Hugh Dickins, Ulrich Drepper, blaisorblade, Linux Kernel Mailing List

On Wed, 2007-06-27 at 11:45 -0700, Davide Libenzi wrote:
> On Wed, 27 Jun 2007, Hugh Dickins wrote:
> 
> > On Wed, 27 Jun 2007, Davide Libenzi wrote:
> > > On Wed, 27 Jun 2007, Hugh Dickins wrote:
> > > 
> > > > In honesty, I should add that I dislike and distrust Davide's
> > > > MAP_NOZERO very much indeed!  Would much rather leave my cpus
> > > > spending a little time in clear_page().  A uid in struct page
> > > > (though I'm sure we could find somewhere to tuck it away) -
> > > > the horror, the horror!  But I've so far failed to find a killer
> > > > argument against it, and am hoping for someone else to do so.
> > > 
> > > Little time? Please, do not trust me. Start oprofile and run a kernel 
> > > build. Look, I'm not even talking about som micro benchmark explicitly 
> > > built to exploit the thing. A kernel build.
> > > You will find clear_page to be the *1st* kernel entry after cc1 and as.
> > > That is bad for two reasons. The time it spends in there, and the cache it 
> > > blows.
> > 
> > I don't doubt that it shows real benefits; but dangerously cutting
> > corners usually shows benefits too.  Relying on a uid at this level
> > feels very wrong to me - but as I said, I've not found a killer
> > argument against it.
> 
> The reason why I posted is exactly so other ppl can look at it and find 
> possible flaws in the way pages and retired. If an effective UID was able 
> to see (or it generated) the data on that page, it should be able to get 
> that page back uncleared (when VM_NOZERO is set).
> >From a performance POV, a 2-3% boost on a non-micro-bench test like a 
> kernel build is not exaclty peanuts. And for more heavy malloc/anon-mmap 
> appliations, the boost goes up to 10-15%. That is not exactly what I 
> call "little time" ;)
>
> - Davide
> 

I don't think the security issues with this will ever make it
worthwhile.

Consider:

1) euid is not sufficient, you need to store away arbitrary LSM
information and call LSM hooks to decide security equivalence. The same
applies to VServer or whatever other container system you use.

2) Two processes, A and B, are in separate VFS namespaces but have
equivalent security identity according to LSM. Process A reads data from
file F which is not visible in process's B's namespace. You have to
prevent process B from ever getting a page that once contained data from
file F.

3) mlock() is often used by programs like GPG to prevent decrypted
secret keys from ever getting swapped out. You need to zero all
once-mlocked pages before they get reused to prevent that page from
getting swapped to disk or application bugs from leaking the key.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 22:11                   ` Nicholas Miell
@ 2007-06-28  0:17                     ` Davide Libenzi
  2007-06-28  2:58                       ` Davide Libenzi
  2007-06-30  7:52                     ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: Davide Libenzi @ 2007-06-28  0:17 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Hugh Dickins, Ulrich Drepper, blaisorblade, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Nicholas Miell wrote:

> 1) euid is not sufficient, you need to store away arbitrary LSM
> information and call LSM hooks to decide security equivalence. The same
> applies to VServer or whatever other container system you use.

The EUID that is used now, can easily be any cookie. It can be an LSM 
cookie (if LSM is active in the system). We don't do complex checks, like 
group permission & Co.  We assume that if a UID-cookie had such data 
available (or it generated it), it can have it back uncleared.



> 2) Two processes, A and B, are in separate VFS namespaces but have
> equivalent security identity according to LSM. Process A reads data from
> file F which is not visible in process's B's namespace. You have to
> prevent process B from ever getting a page that once contained data from
> file F.

They have the *same* security identity. It means that at any time such 
security identity can access resources on both VFS (if it is allowed to 
access such resources - according to security rules in place, LSM or not).
Data is either generated by the security identity, or it is faulted in 
(and it means that the security identity had the GO from the security 
provisioning to access such resource).



> 3) mlock() is often used by programs like GPG to prevent decrypted
> secret keys from ever getting swapped out. You need to zero all
> once-mlocked pages before they get reused to prevent that page from
> getting swapped to disk or application bugs from leaking the key.

GPG and other security software do also memclear on top of mlock, to 
prevent such memory staying alive at all. Just for example, you don't want 
in any case that after an munlock you app core and data goes down.



- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-28  0:17                     ` Davide Libenzi
@ 2007-06-28  2:58                       ` Davide Libenzi
  0 siblings, 0 replies; 25+ messages in thread
From: Davide Libenzi @ 2007-06-28  2:58 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Hugh Dickins, Ulrich Drepper, blaisorblade, Linux Kernel Mailing List

On Wed, 27 Jun 2007, Davide Libenzi wrote:

> On Wed, 27 Jun 2007, Nicholas Miell wrote:
> 
> > 1) euid is not sufficient, you need to store away arbitrary LSM
> > information and call LSM hooks to decide security equivalence. The same
> > applies to VServer or whatever other container system you use.
> 
> The EUID that is used now, can easily be any cookie. It can be an LSM 
> cookie (if LSM is active in the system). We don't do complex checks, like 
> group permission & Co.  We assume that if a UID-cookie had such data 
> available (or it generated it), it can have it back uncleared.

(looking through the LSM/SeLinux jungle)
Also, LSM/SeLinux could disable completely the feature, at request. Just 
assign a known-to-be-invalid UID to mm->owner_uid (passign through 
an(other) hook), and pages will never be recycled.



- Davide



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
  2007-06-27 22:11                   ` Nicholas Miell
  2007-06-28  0:17                     ` Davide Libenzi
@ 2007-06-30  7:52                     ` Andrew Morton
  1 sibling, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2007-06-30  7:52 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Davide Libenzi, Hugh Dickins, Ulrich Drepper, blaisorblade,
	Linux Kernel Mailing List

On Wed, 27 Jun 2007 15:11:49 -0700 Nicholas Miell <nmiell@comcast.net> wrote:

> I don't think the security issues with this will ever make it
> worthwhile.

eh, security issues are a corner case.

The vast majority of Linux machines are used by a single user who has admin
access anyway.  This includes all embedded, all consumer and most laptop
and desktop.

So a reasonable way of getting the benefit of this change into most
people's hands is to forget about the uid/euid issues altogether and just
have a big fat knob which enables this feature, system-wide.  (Radical,
huh.   But then, I liked single user linux.)

A significant problem I see with any such approach is that it yet again
weakens the overall testing and QA effort: libc and the kernel now need to
be tested with and without this feature, and it's yet another question to
be asked of the bug reporters.

(But please take none of this as endorsement.  For some reason the whole
thing gives me the creepies).


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-06-30  7:52 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-27  2:44 [patch 2/3] MAP_NOZERO - implement sys_brk2() Davide Libenzi
2007-06-27  3:07 ` Rik van Riel
2007-06-27  3:33   ` Davide Libenzi
2007-06-27  3:45   ` Ulrich Drepper
2007-06-27  4:11     ` Rik van Riel
2007-06-27  5:04       ` Ulrich Drepper
2007-06-27  3:48 ` Ulrich Drepper
2007-06-27  3:55   ` Davide Libenzi
2007-06-27  5:02     ` Ulrich Drepper
2007-06-27 12:32       ` Hugh Dickins
2007-06-27 15:59         ` Davide Libenzi
2007-06-27 17:01         ` Ulrich Drepper
2007-06-27 17:43           ` Hugh Dickins
2007-06-27 18:13             ` Davide Libenzi
2007-06-27 18:32               ` Hugh Dickins
2007-06-27 18:45                 ` Davide Libenzi
2007-06-27 22:11                   ` Nicholas Miell
2007-06-28  0:17                     ` Davide Libenzi
2007-06-28  2:58                       ` Davide Libenzi
2007-06-30  7:52                     ` Andrew Morton
2007-06-27 18:52             ` Ulrich Drepper
2007-06-27 19:32               ` Hugh Dickins
2007-06-27 19:00           ` Rik van Riel
2007-06-27 19:22             ` Davide Libenzi
2007-06-27 16:05       ` Davide Libenzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).