* [PATCH 0/3] execve argument-copying fixes
@ 2010-09-08 2:34 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08 2:34 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
This is my take on parts of the execve large arguments copying issues
that Kees posted about, and Brad and others have been discussing.
I've only looked at the narrow area of the argument copying code
itself. I think these are good and necessary fixes. But I'm not
addressing the whole OOM killer/mm accounting issue, which also needs
to be fixed (and I have the impression others are already looking into that).
The following changes since commit d56557af19867edb8c0e96f8e26399698a08857f:
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 (2010-09-07 16:00:17 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git topic/exec-fixes
Roland McGrath (3):
setup_arg_pages: diagnose excessive argument size
execve: improve interactivity with large arguments
execve: make responsive to SIGKILL with large arguments
fs/exec.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
2010-09-08 2:34 ` Roland McGrath
@ 2010-09-08 2:35 ` Roland McGrath
-1 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08 2:35 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
check the size of the argument/environment area on the stack.
When it is unworkably large, shift_arg_pages() hits its BUG_ON.
This is exploitable with a very large RLIMIT_STACK limit, to
create a crash pretty easily.
Check that the initial stack is not too large to make it possible
to map in any executable. We're not checking that the actual
executable (or intepreter, for binfmt_elf) will fit. So those
mappings might clobber part of the initial stack mapping. But
that is just userland lossage that userland made happen, not a
kernel problem.
Signed-off-by: Roland McGrath <roland@redhat.com>
---
fs/exec.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..1b63237 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
#else
stack_top = arch_align_stack(stack_top);
stack_top = PAGE_ALIGN(stack_top);
+
+ if (unlikely(stack_top < mmap_min_addr) ||
+ unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
+ return -ENOMEM;
+
stack_shift = vma->vm_end - stack_top;
bprm->p -= stack_shift;
^ permalink raw reply related [flat|nested] 81+ messages in thread
* [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-08 2:35 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08 2:35 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
check the size of the argument/environment area on the stack.
When it is unworkably large, shift_arg_pages() hits its BUG_ON.
This is exploitable with a very large RLIMIT_STACK limit, to
create a crash pretty easily.
Check that the initial stack is not too large to make it possible
to map in any executable. We're not checking that the actual
executable (or intepreter, for binfmt_elf) will fit. So those
mappings might clobber part of the initial stack mapping. But
that is just userland lossage that userland made happen, not a
kernel problem.
Signed-off-by: Roland McGrath <roland@redhat.com>
---
fs/exec.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..1b63237 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
#else
stack_top = arch_align_stack(stack_top);
stack_top = PAGE_ALIGN(stack_top);
+
+ if (unlikely(stack_top < mmap_min_addr) ||
+ unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
+ return -ENOMEM;
+
stack_shift = vma->vm_end - stack_top;
bprm->p -= stack_shift;
^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
2010-09-08 2:35 ` Roland McGrath
@ 2010-09-08 8:29 ` pageexec
-1 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-08 8:29 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton, Roland McGrath
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel,
Brad Spengler <spender@grsecurity.net> Eugene Teo
On 7 Sep 2010 at 19:35, Roland McGrath wrote:
> Check that the initial stack is not too large to make it possible
> to map in any executable. We're not checking that the actual
> executable (or intepreter, for binfmt_elf) will fit. So those
> mappings might clobber part of the initial stack mapping. But
> that is just userland lossage that userland made happen, not a
> kernel problem.
>
> Signed-off-by: Roland McGrath <roland@redhat.com>
> ---
> fs/exec.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d94552..1b63237 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
> #else
> stack_top = arch_align_stack(stack_top);
> stack_top = PAGE_ALIGN(stack_top);
> +
> + if (unlikely(stack_top < mmap_min_addr) ||
this could arguably elicit some warning, or even better, prevent the sysctl from
setting mmap_min_addr too high in the first place. or alternatively, just check
for
mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)
which i think is more clear as to the intent of the check. and since the next
line already computes stack_shift as vma->vm_end - stack_top, you could do the
whole check afterwards as:
mmap_min_addr > vma->vm_start - stack_shift
which is even simpler and more to the point (you want what is called new_start in
shift_arg_pages to not violate mmap_min_addr). now you just need the int overflow
check, say: vma->vm_start < stack_shift, so the whole thing would become:
if (vma->vm_start < stack_shift || mmap_min_addr > vma->vm_start - stack_shift)
return -EFAULT;
which looks even better if done in shift_arg_pages as i've done it in PaX:
- BUG_ON(new_start > new_end);
+ if (new_start >= new_end || new_start < mmap_min_addr)
+ return -EFAULT;
+ unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
i think the = case is as valid as any other value close to it ;). also this code
is not a fast path, no need for unlikely i think.
> + return -ENOMEM;
i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
couldn't allocate physical memory for something, EFAULT is when there's some
issue with the address space itself (based on the reaction to find_vma or
expand_stack failures).
also what about the BUG_ON in shift_arg_pages? it could go now, right?
personally, i prefer to do this the way i did it in PaX: move up the
shift_arg_pages call a bit and turn the BUG_ON into a proper check.
> stack_shift = vma->vm_end - stack_top;
>
> bprm->p -= stack_shift;
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-08 8:29 ` pageexec
0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-08 8:29 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton, Roland McGrath
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel,
Brad Spengler <spender@grsecurity.net> Eugene Teo
On 7 Sep 2010 at 19:35, Roland McGrath wrote:
> Check that the initial stack is not too large to make it possible
> to map in any executable. We're not checking that the actual
> executable (or intepreter, for binfmt_elf) will fit. So those
> mappings might clobber part of the initial stack mapping. But
> that is just userland lossage that userland made happen, not a
> kernel problem.
>
> Signed-off-by: Roland McGrath <roland@redhat.com>
> ---
> fs/exec.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d94552..1b63237 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
> #else
> stack_top = arch_align_stack(stack_top);
> stack_top = PAGE_ALIGN(stack_top);
> +
> + if (unlikely(stack_top < mmap_min_addr) ||
this could arguably elicit some warning, or even better, prevent the sysctl from
setting mmap_min_addr too high in the first place. or alternatively, just check
for
mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)
which i think is more clear as to the intent of the check. and since the next
line already computes stack_shift as vma->vm_end - stack_top, you could do the
whole check afterwards as:
mmap_min_addr > vma->vm_start - stack_shift
which is even simpler and more to the point (you want what is called new_start in
shift_arg_pages to not violate mmap_min_addr). now you just need the int overflow
check, say: vma->vm_start < stack_shift, so the whole thing would become:
if (vma->vm_start < stack_shift || mmap_min_addr > vma->vm_start - stack_shift)
return -EFAULT;
which looks even better if done in shift_arg_pages as i've done it in PaX:
- BUG_ON(new_start > new_end);
+ if (new_start >= new_end || new_start < mmap_min_addr)
+ return -EFAULT;
+ unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
i think the = case is as valid as any other value close to it ;). also this code
is not a fast path, no need for unlikely i think.
> + return -ENOMEM;
i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
couldn't allocate physical memory for something, EFAULT is when there's some
issue with the address space itself (based on the reaction to find_vma or
expand_stack failures).
also what about the BUG_ON in shift_arg_pages? it could go now, right?
personally, i prefer to do this the way i did it in PaX: move up the
shift_arg_pages call a bit and turn the BUG_ON into a proper check.
> stack_shift = vma->vm_end - stack_top;
>
> bprm->p -= stack_shift;
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10 8:59 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10 8:59 UTC (permalink / raw)
To: pageexec
Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
KOSAKI Motohiro, Neil Horman, linux-fsdevel,
Brad Spengler <spender@grsecurity.net> Eugene Teo
> > + if (unlikely(stack_top < mmap_min_addr) ||
>
> this could arguably elicit some warning, or even better, prevent the
> sysctl from setting mmap_min_addr too high in the first place.
This code has local checks to ensure that things don't fail worse later.
Those are good to have regardless. If you'd like to add some constraints
on setting mmap_min_addr, that's certainly fine too. But it's no reason
not to have this simple and clear check here.
> or alternatively, just check for
>
> mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)
IMHO that is far less clear as to what the intent of the check is than what
I wrote. It's especially subtle that this check allows overflow and then
you check for that later, rather than the formulation I gave where no
overflow is possible and it's immediately obvious why.
> which looks even better if done in shift_arg_pages as i've done it in PaX:
My change to setup_arg_pages is consistent with the existing
CONFIG_STACK_GROWSUP code. IMHO simple fixes should go in first
and other restructuring of the code can be done later.
> > + return -ENOMEM;
>
> i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
> couldn't allocate physical memory for something, EFAULT is when there's some
> issue with the address space itself (based on the reaction to find_vma or
> expand_stack failures).
I disagree. IMHO, the only proper use of EFAULT is for the passing of a
bad pointer (including passing data that includes a bad pointer, etc.).
So execve really should only use EFAULT when the pointers passed in the
system call are bad. I used ENOMEM purely for the reason that the
existing CONFIG_STACK_GROWSUP code I paralleled uses it. But it
certainly does make more sense than EFAULT does. ENOMEM is used for
cases where you couldn't get the address space you wanted (mmap), which
is a fairly close analogue to this case.
But the one error code that is actually correct for the case is E2BIG.
Obviously that's really the right thing for Argument list too long cases.
The choice of error code here is fairly academic. Any failure of
setup_arg_pages always leads to sending ourselves a SIGKILL before
returning the error code to percolate back to execve. So it's only the
error code seen in syscall tracing before the process dies. It's not
"user-visible" in the usual sense (i.e. POSIX doesn't care what code we
use here).
> also what about the BUG_ON in shift_arg_pages? it could go now, right?
It is now impossible by the logic of the arithmetic, yes. But it is
another local sanity check asserting the assumptions of the code in that
function. So there is no reason to
> personally, i prefer to do this the way i did it in PaX: move up the
> shift_arg_pages call a bit and turn the BUG_ON into a proper check.
I'm more comfortable with a fix that doesn't change any other aspect of the
behavior. Probably moving the call to shift_arg_pages around is fine. But
IMHO that belongs in a separate set of cleanup patches, and not in the
isolated fix for the BUG_ON hit.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10 8:59 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10 8:59 UTC (permalink / raw)
To: pageexec-Y8qEzhMunLyT9ig0jae3mg
Cc: Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
Brad Spengler
<spender-JNS0hek0TMl4qEwOxq4T+Q@public.gmane.org> Eugene
Teo
> > + if (unlikely(stack_top < mmap_min_addr) ||
>
> this could arguably elicit some warning, or even better, prevent the
> sysctl from setting mmap_min_addr too high in the first place.
This code has local checks to ensure that things don't fail worse later.
Those are good to have regardless. If you'd like to add some constraints
on setting mmap_min_addr, that's certainly fine too. But it's no reason
not to have this simple and clear check here.
> or alternatively, just check for
>
> mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)
IMHO that is far less clear as to what the intent of the check is than what
I wrote. It's especially subtle that this check allows overflow and then
you check for that later, rather than the formulation I gave where no
overflow is possible and it's immediately obvious why.
> which looks even better if done in shift_arg_pages as i've done it in PaX:
My change to setup_arg_pages is consistent with the existing
CONFIG_STACK_GROWSUP code. IMHO simple fixes should go in first
and other restructuring of the code can be done later.
> > + return -ENOMEM;
>
> i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
> couldn't allocate physical memory for something, EFAULT is when there's some
> issue with the address space itself (based on the reaction to find_vma or
> expand_stack failures).
I disagree. IMHO, the only proper use of EFAULT is for the passing of a
bad pointer (including passing data that includes a bad pointer, etc.).
So execve really should only use EFAULT when the pointers passed in the
system call are bad. I used ENOMEM purely for the reason that the
existing CONFIG_STACK_GROWSUP code I paralleled uses it. But it
certainly does make more sense than EFAULT does. ENOMEM is used for
cases where you couldn't get the address space you wanted (mmap), which
is a fairly close analogue to this case.
But the one error code that is actually correct for the case is E2BIG.
Obviously that's really the right thing for Argument list too long cases.
The choice of error code here is fairly academic. Any failure of
setup_arg_pages always leads to sending ourselves a SIGKILL before
returning the error code to percolate back to execve. So it's only the
error code seen in syscall tracing before the process dies. It's not
"user-visible" in the usual sense (i.e. POSIX doesn't care what code we
use here).
> also what about the BUG_ON in shift_arg_pages? it could go now, right?
It is now impossible by the logic of the arithmetic, yes. But it is
another local sanity check asserting the assumptions of the code in that
function. So there is no reason to
> personally, i prefer to do this the way i did it in PaX: move up the
> shift_arg_pages call a bit and turn the BUG_ON into a proper check.
I'm more comfortable with a fix that doesn't change any other aspect of the
behavior. Probably moving the call to shift_arg_pages around is fine. But
IMHO that belongs in a separate set of cleanup patches, and not in the
isolated fix for the BUG_ON hit.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-11 13:30 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-11 13:30 UTC (permalink / raw)
To: Roland McGrath
Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
KOSAKI Motohiro, Neil Horman, linux-fsdevel,
Brad Spengler <spender@grsecurity.net> Eugene Teo
On 10 Sep 2010 at 1:59, Roland McGrath wrote:
> > > + if (unlikely(stack_top < mmap_min_addr) ||
> >
> > this could arguably elicit some warning, or even better, prevent the
> > sysctl from setting mmap_min_addr too high in the first place.
>
> This code has local checks to ensure that things don't fail worse later.
> Those are good to have regardless. If you'd like to add some constraints
> on setting mmap_min_addr, that's certainly fine too. But it's no reason
> not to have this simple and clear check here.
it's somewhat confusing as it does not immediately relate to the problem
you're addressing. let's take a step back and see what issues we have here:
1. the would-be stack vma shifts down so much that it becomes completely
invalid (due to the int wraparound on its start address).
2. the would be stack vma shifts down to violate mmap_min_addr.
it's obvious that before ensuring 2, 1 must pass. the check for 1 does not
at all involve mmap_min_addr, whereas both of your checks do. even worse, the
check for 2 doesn't depend on the relation between stack_top and mmap_min_addr
per se (userland has no influence over them), so as i said, it's all confusing.
> > or alternatively, just check for
> >
> > mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)
>
> IMHO that is far less clear as to what the intent of the check is than what
> I wrote.
why is that? it's obvious that vm_end-vm_size=vm_size and the check simply
ensures that problem 2 above doesn't occur whereas in neither of your checks
can one see the clear intent for ensuring 2. your 2nd check basically ensures
that vm_size fits in the usable address space bounded by [mmap_min_addr, stack_top]
but it doesn't immediately tell the user why that size check is useful (especially
since we're not concerned with sizes per se, but addresses, it's just a corollary
in this particular case that a size check can replace an address check, but that
doesn't mean it'll help any future reader see the original intent). so while
the math is ok in your checks (modulo the = case), it's a kind of premature
optimization that hides the problems you're addressing.
> It's especially subtle that this check allows overflow
> and then you check for that later,
no, actually that particular check was for addressing one particular
problem (to show you how to formulate it that makes intent clear), as a
train of thought if you like. the final proposed version (which has been
in PaX) has the int overflow check first and then the mmap_min_addr check.
> rather than the formulation I gave
> where no overflow is possible and it's immediately obvious why.
you're only focused on the math and let the human intent slip by, that's
the issue i have with your checks (and you didn't get the = corner case
right btw, it should be allowed, nor refused if you want to be pedantic).
really, in security clarity of intent and its manifestation in code takes
precedence over optimization, let the compiler do the latter job.
> > which looks even better if done in shift_arg_pages as i've done it in PaX:
>
> My change to setup_arg_pages is consistent with the existing
> CONFIG_STACK_GROWSUP code.
the GROWSUP code doesn't use unlikely (very confusing to see it in this
context since you're not addressing any hot path/performance issues), nor
does it care about mmap_min_addr (whereas both of your checks do which
depending on the generated asm, may even be racy when mmap_min_addr changes)
nor does it check for any int overflows. about the only similarity i can
see is its computation of 'vm_size' and an associated size check (properly
commented unlike your code) but it's normal there, whereas the problems you're
addressing are fundamentally vma address related, not vma size related.
> IMHO simple fixes should go in first
> and other restructuring of the code can be done later.
yours is not a simple fix as it doesn't make it clear what two problems
you're trying to address with your checks.
> > > + return -ENOMEM;
> >
> > i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
> > couldn't allocate physical memory for something, EFAULT is when there's some
> > issue with the address space itself (based on the reaction to find_vma or
> > expand_stack failures).
>
> I disagree.
well, in that case the EFAULT return codes in this function need some
cleanup too, you know, in the name of consistency ;).
> > also what about the BUG_ON in shift_arg_pages? it could go now, right?
>
> It is now impossible by the logic of the arithmetic, yes. But it is
> another local sanity check asserting the assumptions of the code in that
> function. So there is no reason to
there is no reason for it to be a BUG_ON (never was in fact), the kernel
can easily recover from the detected condition.
cheers,
PaX Team
PS: i thought it was common courtesy and customary to credit people in the
commit who found/reported/fixed problems.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-11 13:30 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-11 13:30 UTC (permalink / raw)
To: Roland McGrath
Cc: Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
Brad Spengler
<spender-JNS0hek0TMl4qEwOxq4T+Q@public.gmane.org> Eugene
Teo
On 10 Sep 2010 at 1:59, Roland McGrath wrote:
> > > + if (unlikely(stack_top < mmap_min_addr) ||
> >
> > this could arguably elicit some warning, or even better, prevent the
> > sysctl from setting mmap_min_addr too high in the first place.
>
> This code has local checks to ensure that things don't fail worse later.
> Those are good to have regardless. If you'd like to add some constraints
> on setting mmap_min_addr, that's certainly fine too. But it's no reason
> not to have this simple and clear check here.
it's somewhat confusing as it does not immediately relate to the problem
you're addressing. let's take a step back and see what issues we have here:
1. the would-be stack vma shifts down so much that it becomes completely
invalid (due to the int wraparound on its start address).
2. the would be stack vma shifts down to violate mmap_min_addr.
it's obvious that before ensuring 2, 1 must pass. the check for 1 does not
at all involve mmap_min_addr, whereas both of your checks do. even worse, the
check for 2 doesn't depend on the relation between stack_top and mmap_min_addr
per se (userland has no influence over them), so as i said, it's all confusing.
> > or alternatively, just check for
> >
> > mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)
>
> IMHO that is far less clear as to what the intent of the check is than what
> I wrote.
why is that? it's obvious that vm_end-vm_size=vm_size and the check simply
ensures that problem 2 above doesn't occur whereas in neither of your checks
can one see the clear intent for ensuring 2. your 2nd check basically ensures
that vm_size fits in the usable address space bounded by [mmap_min_addr, stack_top]
but it doesn't immediately tell the user why that size check is useful (especially
since we're not concerned with sizes per se, but addresses, it's just a corollary
in this particular case that a size check can replace an address check, but that
doesn't mean it'll help any future reader see the original intent). so while
the math is ok in your checks (modulo the = case), it's a kind of premature
optimization that hides the problems you're addressing.
> It's especially subtle that this check allows overflow
> and then you check for that later,
no, actually that particular check was for addressing one particular
problem (to show you how to formulate it that makes intent clear), as a
train of thought if you like. the final proposed version (which has been
in PaX) has the int overflow check first and then the mmap_min_addr check.
> rather than the formulation I gave
> where no overflow is possible and it's immediately obvious why.
you're only focused on the math and let the human intent slip by, that's
the issue i have with your checks (and you didn't get the = corner case
right btw, it should be allowed, nor refused if you want to be pedantic).
really, in security clarity of intent and its manifestation in code takes
precedence over optimization, let the compiler do the latter job.
> > which looks even better if done in shift_arg_pages as i've done it in PaX:
>
> My change to setup_arg_pages is consistent with the existing
> CONFIG_STACK_GROWSUP code.
the GROWSUP code doesn't use unlikely (very confusing to see it in this
context since you're not addressing any hot path/performance issues), nor
does it care about mmap_min_addr (whereas both of your checks do which
depending on the generated asm, may even be racy when mmap_min_addr changes)
nor does it check for any int overflows. about the only similarity i can
see is its computation of 'vm_size' and an associated size check (properly
commented unlike your code) but it's normal there, whereas the problems you're
addressing are fundamentally vma address related, not vma size related.
> IMHO simple fixes should go in first
> and other restructuring of the code can be done later.
yours is not a simple fix as it doesn't make it clear what two problems
you're trying to address with your checks.
> > > + return -ENOMEM;
> >
> > i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
> > couldn't allocate physical memory for something, EFAULT is when there's some
> > issue with the address space itself (based on the reaction to find_vma or
> > expand_stack failures).
>
> I disagree.
well, in that case the EFAULT return codes in this function need some
cleanup too, you know, in the name of consistency ;).
> > also what about the BUG_ON in shift_arg_pages? it could go now, right?
>
> It is now impossible by the logic of the arithmetic, yes. But it is
> another local sanity check asserting the assumptions of the code in that
> function. So there is no reason to
there is no reason for it to be a BUG_ON (never was in fact), the kernel
can easily recover from the detected condition.
cheers,
PaX Team
PS: i thought it was common courtesy and customary to credit people in the
commit who found/reported/fixed problems.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
2010-09-11 13:30 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
(?)
@ 2010-09-14 19:33 ` Roland McGrath
2010-09-14 22:35 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
-1 siblings, 1 reply; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 19:33 UTC (permalink / raw)
To: pageexec
Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
KOSAKI Motohiro, Neil Horman, linux-fsdevel, Brad Spengler,
Eugene Teo
Readability and obviousness are in the eye of the beholder.
Linus has merged my patches, so this crash is now fixed.
As I said before, you should by all means send further clean-up
patches if you think they are improvements.
I have no special interest in this area. I gave my opinions and
advice because I was CC'd, and then submitted some patches myself
because I was asked to. I can't really keep track of all the
formalities. The CC list on my submissions represented the set
of people I knew to be involved in the discussion. I'm sorry if
you felt slighted. I just pay attention to the code, not the
personalities. The most reliable way to make sure your name is
associated with a change is to submit patches yourself in a form
that Linus wants to merge.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 22:35 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-14 22:35 UTC (permalink / raw)
To: Roland McGrath
Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
KOSAKI Motohiro, Neil Horman, linux-fsdevel, Brad Spengler,
Eugene Teo
On 14 Sep 2010 at 12:33, Roland McGrath wrote:
> I have no special interest in this area. I gave my opinions and
> advice because I was CC'd, and then submitted some patches myself
> because I was asked to. I can't really keep track of all the
> formalities. The CC list on my submissions represented the set
> of people I knew to be involved in the discussion. I'm sorry if
> you felt slighted. I just pay attention to the code, not the
> personalities.
i was referring to Brad, since he reported this (and the wireless
heap infoleak which he also wasn't credited for). anyway, we won't be
submitting any further vulnerability reports, so no need to worry about
any 'formalities' in the future.
> The most reliable way to make sure your name is
> associated with a change is to submit patches yourself in a form
> that Linus wants to merge.
his conditions are unacceptable and therefore it's not going to happen.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 22:35 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-14 22:35 UTC (permalink / raw)
To: Roland McGrath
Cc: Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Brad Spengler, Eugene Teo
On 14 Sep 2010 at 12:33, Roland McGrath wrote:
> I have no special interest in this area. I gave my opinions and
> advice because I was CC'd, and then submitted some patches myself
> because I was asked to. I can't really keep track of all the
> formalities. The CC list on my submissions represented the set
> of people I knew to be involved in the discussion. I'm sorry if
> you felt slighted. I just pay attention to the code, not the
> personalities.
i was referring to Brad, since he reported this (and the wireless
heap infoleak which he also wasn't credited for). anyway, we won't be
submitting any further vulnerability reports, so no need to worry about
any 'formalities' in the future.
> The most reliable way to make sure your name is
> associated with a change is to submit patches yourself in a form
> that Linus wants to merge.
his conditions are unacceptable and therefore it's not going to happen.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
2010-09-08 2:35 ` Roland McGrath
@ 2010-09-08 11:57 ` Brad Spengler
-1 siblings, 0 replies; 81+ messages in thread
From: Brad Spengler @ 2010-09-08 11:57 UTC (permalink / raw)
To: Roland McGrath
Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
KOSAKI Motohiro, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]
I still don't think this addresses the whole problem. Without question,
the rlimit / 4 check is bogus. If nobody agrees with the intent of that
check, then it should be removed, but I think the better solution is to
fix the check so that it matches its original intent: let the initial
stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon
personality), which allows space for additional stack setup in the ELF
loader and then further growth once the process is live. If that
amount is overstepped, then the exec will return an error to the calling
process instead of being terminated.
It might be useful to consult with the people who introduced/approved
the check in the first place, as they seemed to have reasons for
implementing it.
-Brad
On Tue, Sep 07, 2010 at 07:35:49PM -0700, Roland McGrath wrote:
> The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
> check the size of the argument/environment area on the stack.
> When it is unworkably large, shift_arg_pages() hits its BUG_ON.
> This is exploitable with a very large RLIMIT_STACK limit, to
> create a crash pretty easily.
>
> Check that the initial stack is not too large to make it possible
> to map in any executable. We're not checking that the actual
> executable (or intepreter, for binfmt_elf) will fit. So those
> mappings might clobber part of the initial stack mapping. But
> that is just userland lossage that userland made happen, not a
> kernel problem.
>
> Signed-off-by: Roland McGrath <roland@redhat.com>
> ---
> fs/exec.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d94552..1b63237 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
> #else
> stack_top = arch_align_stack(stack_top);
> stack_top = PAGE_ALIGN(stack_top);
> +
> + if (unlikely(stack_top < mmap_min_addr) ||
> + unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
> + return -ENOMEM;
> +
> stack_shift = vma->vm_end - stack_top;
>
> bprm->p -= stack_shift;
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-08 11:57 ` Brad Spengler
0 siblings, 0 replies; 81+ messages in thread
From: Brad Spengler @ 2010-09-08 11:57 UTC (permalink / raw)
To: Roland McGrath
Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
KOSAKI Motohiro, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]
I still don't think this addresses the whole problem. Without question,
the rlimit / 4 check is bogus. If nobody agrees with the intent of that
check, then it should be removed, but I think the better solution is to
fix the check so that it matches its original intent: let the initial
stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon
personality), which allows space for additional stack setup in the ELF
loader and then further growth once the process is live. If that
amount is overstepped, then the exec will return an error to the calling
process instead of being terminated.
It might be useful to consult with the people who introduced/approved
the check in the first place, as they seemed to have reasons for
implementing it.
-Brad
On Tue, Sep 07, 2010 at 07:35:49PM -0700, Roland McGrath wrote:
> The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
> check the size of the argument/environment area on the stack.
> When it is unworkably large, shift_arg_pages() hits its BUG_ON.
> This is exploitable with a very large RLIMIT_STACK limit, to
> create a crash pretty easily.
>
> Check that the initial stack is not too large to make it possible
> to map in any executable. We're not checking that the actual
> executable (or intepreter, for binfmt_elf) will fit. So those
> mappings might clobber part of the initial stack mapping. But
> that is just userland lossage that userland made happen, not a
> kernel problem.
>
> Signed-off-by: Roland McGrath <roland@redhat.com>
> ---
> fs/exec.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d94552..1b63237 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
> #else
> stack_top = arch_align_stack(stack_top);
> stack_top = PAGE_ALIGN(stack_top);
> +
> + if (unlikely(stack_top < mmap_min_addr) ||
> + unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
> + return -ENOMEM;
> +
> stack_shift = vma->vm_end - stack_top;
>
> bprm->p -= stack_shift;
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
2010-09-08 11:57 ` Brad Spengler
@ 2010-09-09 5:31 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 5:31 UTC (permalink / raw)
To: Brad Spengler
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
> I still don't think this addresses the whole problem. Without question,
> the rlimit / 4 check is bogus. If nobody agrees with the intent of that
> check, then it should be removed, but I think the better solution is to
> fix the check so that it matches its original intent: let the initial
> stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon
> personality), which allows space for additional stack setup in the ELF
> loader and then further growth once the process is live. If that
> amount is overstepped, then the exec will return an error to the calling
> process instead of being terminated.
>
> It might be useful to consult with the people who introduced/approved
> the check in the first place, as they seemed to have reasons for
> implementing it.
Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
can't change this even though you disliked. That said, we can't break
userland even though userland library is very crazy.
I don't dislike your "1/Xth of the min(rlimit, TASK_SIZE dependent upon
> personality)" idea. however I think You and Roland haven't agreed this
point yet. he seems to want "unlimited" works as "unlimited". then, now
I don't make such patch. Instead, I would propose to insert
__vm_enough_memory() check in execve() pass. It prevent almost argv attack.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-09 5:31 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 5:31 UTC (permalink / raw)
To: Brad Spengler
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
> I still don't think this addresses the whole problem. Without question,
> the rlimit / 4 check is bogus. If nobody agrees with the intent of that
> check, then it should be removed, but I think the better solution is to
> fix the check so that it matches its original intent: let the initial
> stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon
> personality), which allows space for additional stack setup in the ELF
> loader and then further growth once the process is live. If that
> amount is overstepped, then the exec will return an error to the calling
> process instead of being terminated.
>
> It might be useful to consult with the people who introduced/approved
> the check in the first place, as they seemed to have reasons for
> implementing it.
Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
can't change this even though you disliked. That said, we can't break
userland even though userland library is very crazy.
I don't dislike your "1/Xth of the min(rlimit, TASK_SIZE dependent upon
> personality)" idea. however I think You and Roland haven't agreed this
point yet. he seems to want "unlimited" works as "unlimited". then, now
I don't make such patch. Instead, I would propose to insert
__vm_enough_memory() check in execve() pass. It prevent almost argv attack.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10 9:25 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10 9:25 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Brad Spengler, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec, Brad Spengler, Eugene Teo
> Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> can't change this even though you disliked. That said, we can't break
> userland even though userland library is very crazy.
I'm sorry you think it's "very crazy" to implement the required
functionality in the only way available. POSIX requires that execve
fail with E2BIG when the ARG_MAX limit is exceeded. sysconf has to
return the correct actual limit that execve will enforce so that a
conforming application knows how much it can safely attempt to use.
Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
not expose the true manifest limit any other way, sysconf has to
parallel the kernel's calculation.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10 9:25 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10 9:25 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
pageexec-Y8qEzhMunLyT9ig0jae3mg, Brad Spengler, Eugene Teo
> Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> can't change this even though you disliked. That said, we can't break
> userland even though userland library is very crazy.
I'm sorry you think it's "very crazy" to implement the required
functionality in the only way available. POSIX requires that execve
fail with E2BIG when the ARG_MAX limit is exceeded. sysconf has to
return the correct actual limit that execve will enforce so that a
conforming application knows how much it can safely attempt to use.
Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
not expose the true manifest limit any other way, sysconf has to
parallel the kernel's calculation.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
2010-09-10 9:25 ` Roland McGrath
(?)
@ 2010-09-10 9:43 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 9:43 UTC (permalink / raw)
To: Roland McGrath
Cc: kosaki.motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec, Eugene Teo
> > Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> > by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> > can't change this even though you disliked. That said, we can't break
> > userland even though userland library is very crazy.
>
> I'm sorry you think it's "very crazy" to implement the required
> functionality in the only way available. POSIX requires that execve
> fail with E2BIG when the ARG_MAX limit is exceeded. sysconf has to
> return the correct actual limit that execve will enforce so that a
> conforming application knows how much it can safely attempt to use.
> Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
> not expose the true manifest limit any other way, sysconf has to
> parallel the kernel's calculation.
Hmm...
Probably my poor english leaded to misunderstood. I didn't intent glibc
is very crazy. I only intended to "even if userland is crazy, I disagree
to break userland".
And yes, we obviously need to expose ARG_MAX limit to libc. a duplicated
heuristic code easily makes confusion and mistake. nobody want such
fragile state. however, it's a bit offtopic. anyway.
Thanks.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-11 13:39 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-11 13:39 UTC (permalink / raw)
To: KOSAKI Motohiro, Roland McGrath
Cc: Brad Spengler, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, Eugene Teo
On 10 Sep 2010 at 2:25, Roland McGrath wrote:
> > Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> > by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> > can't change this even though you disliked. That said, we can't break
> > userland even though userland library is very crazy.
>
> I'm sorry you think it's "very crazy" to implement the required
> functionality in the only way available. POSIX requires that execve
> fail with E2BIG when the ARG_MAX limit is exceeded. sysconf has to
> return the correct actual limit that execve will enforce so that a
> conforming application knows how much it can safely attempt to use.
> Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
> not expose the true manifest limit any other way, sysconf has to
> parallel the kernel's calculation.
no it doesn't have to, similarly to how it doesn't have to hardcode
_SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
and i think AT_ARGMAX could exist just as well.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-11 13:39 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-11 13:39 UTC (permalink / raw)
To: KOSAKI Motohiro, Roland McGrath
Cc: Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo
On 10 Sep 2010 at 2:25, Roland McGrath wrote:
> > Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> > by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> > can't change this even though you disliked. That said, we can't break
> > userland even though userland library is very crazy.
>
> I'm sorry you think it's "very crazy" to implement the required
> functionality in the only way available. POSIX requires that execve
> fail with E2BIG when the ARG_MAX limit is exceeded. sysconf has to
> return the correct actual limit that execve will enforce so that a
> conforming application knows how much it can safely attempt to use.
> Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
> not expose the true manifest limit any other way, sysconf has to
> parallel the kernel's calculation.
no it doesn't have to, similarly to how it doesn't have to hardcode
_SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
and i think AT_ARGMAX could exist just as well.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 18:51 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 18:51 UTC (permalink / raw)
To: pageexec
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo
> no it doesn't have to, similarly to how it doesn't have to hardcode
> _SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
> and i think AT_ARGMAX could exist just as well.
I was referring to the ways available to userland heretofore. Certainly,
the kernel could add new ways and then userland could do different things
(with new kernels).
auxv in particular is not a mechanism that could fit for this. The actual
limit depends on rlimits of the calling process, and rlimits can change
during the life of the program. auxv is only appropriate for things that
are known at the time of the exec and won't change thereafter.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 18:51 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 18:51 UTC (permalink / raw)
To: pageexec-Y8qEzhMunLyT9ig0jae3mg
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo
> no it doesn't have to, similarly to how it doesn't have to hardcode
> _SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
> and i think AT_ARGMAX could exist just as well.
I was referring to the ways available to userland heretofore. Certainly,
the kernel could add new ways and then userland could do different things
(with new kernels).
auxv in particular is not a mechanism that could fit for this. The actual
limit depends on rlimits of the calling process, and rlimits can change
during the life of the program. auxv is only appropriate for things that
are known at the time of the exec and won't change thereafter.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 20:28 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-14 20:28 UTC (permalink / raw)
To: Roland McGrath
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo
On 14 Sep 2010 at 11:51, Roland McGrath wrote:
> > no it doesn't have to, similarly to how it doesn't have to hardcode
> > _SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
> > and i think AT_ARGMAX could exist just as well.
>
> I was referring to the ways available to userland heretofore. Certainly,
> the kernel could add new ways and then userland could do different things
> (with new kernels).
>
> auxv in particular is not a mechanism that could fit for this. The actual
> limit depends on rlimits of the calling process, and rlimits can change
> during the life of the program.
obviously an AT_ARGMAX computed at execve time would be based on the rlimits
as well and if later userland changed the rlimits, it'd be userland's problem,
not that of the kernel (or the kernel could refuse a change that would violate
its earlier promise).
> auxv is only appropriate for things that
> are known at the time of the exec and won't change thereafter.
you mean stuff like AT_EUID et al.? ;)
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 20:28 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-14 20:28 UTC (permalink / raw)
To: Roland McGrath
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo
On 14 Sep 2010 at 11:51, Roland McGrath wrote:
> > no it doesn't have to, similarly to how it doesn't have to hardcode
> > _SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
> > and i think AT_ARGMAX could exist just as well.
>
> I was referring to the ways available to userland heretofore. Certainly,
> the kernel could add new ways and then userland could do different things
> (with new kernels).
>
> auxv in particular is not a mechanism that could fit for this. The actual
> limit depends on rlimits of the calling process, and rlimits can change
> during the life of the program.
obviously an AT_ARGMAX computed at execve time would be based on the rlimits
as well and if later userland changed the rlimits, it'd be userland's problem,
not that of the kernel (or the kernel could refuse a change that would violate
its earlier promise).
> auxv is only appropriate for things that
> are known at the time of the exec and won't change thereafter.
you mean stuff like AT_EUID et al.? ;)
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 21:16 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 21:16 UTC (permalink / raw)
To: pageexec
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo
> obviously an AT_ARGMAX computed at execve time would be based on the rlimits
> as well and if later userland changed the rlimits, it'd be userland's problem,
> not that of the kernel (or the kernel could refuse a change that would violate
> its earlier promise).
This would thoroughly defeat the purpose of adding the thing. The
only reason to have a new thing is so that userland does not have to
mirror the kernel's policy (as it attempts to do now, with the 1/4
calculation). If the new thing is not something that userland can
use consistently so as not to have to know what the kernel's actual
policy is, then I don't see the point of it at all.
> > auxv is only appropriate for things that
> > are known at the time of the exec and won't change thereafter.
>
> you mean stuff like AT_EUID et al.? ;)
The information that these give is about the conditions at startup.
That's what they mean to userland, and userland only uses them to know
the situation before it has made any calls. The definition of AT_EUID
is "effective user ID at program startup", and that fact does not
change. You proposed AT_ARGMAX for a purpose that requires knowing
the current information in the process at the time it might attempt an
execve call, not at startup. It is not an equivalent case.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 21:16 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 21:16 UTC (permalink / raw)
To: pageexec-Y8qEzhMunLyT9ig0jae3mg
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo
> obviously an AT_ARGMAX computed at execve time would be based on the rlimits
> as well and if later userland changed the rlimits, it'd be userland's problem,
> not that of the kernel (or the kernel could refuse a change that would violate
> its earlier promise).
This would thoroughly defeat the purpose of adding the thing. The
only reason to have a new thing is so that userland does not have to
mirror the kernel's policy (as it attempts to do now, with the 1/4
calculation). If the new thing is not something that userland can
use consistently so as not to have to know what the kernel's actual
policy is, then I don't see the point of it at all.
> > auxv is only appropriate for things that
> > are known at the time of the exec and won't change thereafter.
>
> you mean stuff like AT_EUID et al.? ;)
The information that these give is about the conditions at startup.
That's what they mean to userland, and userland only uses them to know
the situation before it has made any calls. The definition of AT_EUID
is "effective user ID at program startup", and that fact does not
change. You proposed AT_ARGMAX for a purpose that requires knowing
the current information in the process at the time it might attempt an
execve call, not at startup. It is not an equivalent case.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 22:27 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-14 22:27 UTC (permalink / raw)
To: Roland McGrath
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo
On 14 Sep 2010 at 14:16, Roland McGrath wrote:
> > obviously an AT_ARGMAX computed at execve time would be based on the rlimits
> > as well and if later userland changed the rlimits, it'd be userland's problem,
> > not that of the kernel (or the kernel could refuse a change that would violate
> > its earlier promise).
>
> This would thoroughly defeat the purpose of adding the thing. The
> only reason to have a new thing is so that userland does not have to
> mirror the kernel's policy (as it attempts to do now, with the 1/4
> calculation). If the new thing is not something that userland can
> use consistently so as not to have to know what the kernel's actual
> policy is, then I don't see the point of it at all.
userland could never rely on the kernel's policy at all since get_arg_page
could have failed for more reasons than overstepping the currently hardcoded
ARG_MAX check in there. so what AT_ARGMAX would buy us is to allow the kernel
policy to change over time, but it's never been about guarantees, whether
POSIX wants such a thing or not.
> > > auxv is only appropriate for things that
> > > are known at the time of the exec and won't change thereafter.
> >
> > you mean stuff like AT_EUID et al.? ;)
>
> The information that these give is about the conditions at startup.
> That's what they mean to userland, and userland only uses them to know
> the situation before it has made any calls. The definition of AT_EUID
> is "effective user ID at program startup", and that fact does not
> change.
just for my own curiosity, where does this definition come from?
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 22:27 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-14 22:27 UTC (permalink / raw)
To: Roland McGrath
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo
On 14 Sep 2010 at 14:16, Roland McGrath wrote:
> > obviously an AT_ARGMAX computed at execve time would be based on the rlimits
> > as well and if later userland changed the rlimits, it'd be userland's problem,
> > not that of the kernel (or the kernel could refuse a change that would violate
> > its earlier promise).
>
> This would thoroughly defeat the purpose of adding the thing. The
> only reason to have a new thing is so that userland does not have to
> mirror the kernel's policy (as it attempts to do now, with the 1/4
> calculation). If the new thing is not something that userland can
> use consistently so as not to have to know what the kernel's actual
> policy is, then I don't see the point of it at all.
userland could never rely on the kernel's policy at all since get_arg_page
could have failed for more reasons than overstepping the currently hardcoded
ARG_MAX check in there. so what AT_ARGMAX would buy us is to allow the kernel
policy to change over time, but it's never been about guarantees, whether
POSIX wants such a thing or not.
> > > auxv is only appropriate for things that
> > > are known at the time of the exec and won't change thereafter.
> >
> > you mean stuff like AT_EUID et al.? ;)
>
> The information that these give is about the conditions at startup.
> That's what they mean to userland, and userland only uses them to know
> the situation before it has made any calls. The definition of AT_EUID
> is "effective user ID at program startup", and that fact does not
> change.
just for my own curiosity, where does this definition come from?
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 23:04 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 23:04 UTC (permalink / raw)
To: pageexec
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo
> userland could never rely on the kernel's policy at all since get_arg_page
> could have failed for more reasons than overstepping the currently hardcoded
> ARG_MAX check in there.
I don't see how it could fail except for OOM cases where get_user_pages()
failed rather than blocking. Is that what you mean?
> so what AT_ARGMAX would buy us is to allow the kernel
> policy to change over time, but it's never been about guarantees, whether
> POSIX wants such a thing or not.
I understand the motivation for an explicit mechanism for the kernel to
tell userland its limit. Since the kernel policy today depends on
something that can change between execs, AT_ARGMAX is inadequate for
that purpose for today's policy, let alone any future different policy.
> > The information that these give is about the conditions at startup.
> > That's what they mean to userland, and userland only uses them to know
> > the situation before it has made any calls. The definition of AT_EUID
> > is "effective user ID at program startup", and that fact does not
> > change.
>
> just for my own curiosity, where does this definition come from?
You mean documentation? I'm not really sure if there is any for that.
But it's the inherent definition of auxv that all its information can
only be about the conditions at program startup.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 23:04 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 23:04 UTC (permalink / raw)
To: pageexec-Y8qEzhMunLyT9ig0jae3mg
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo
> userland could never rely on the kernel's policy at all since get_arg_page
> could have failed for more reasons than overstepping the currently hardcoded
> ARG_MAX check in there.
I don't see how it could fail except for OOM cases where get_user_pages()
failed rather than blocking. Is that what you mean?
> so what AT_ARGMAX would buy us is to allow the kernel
> policy to change over time, but it's never been about guarantees, whether
> POSIX wants such a thing or not.
I understand the motivation for an explicit mechanism for the kernel to
tell userland its limit. Since the kernel policy today depends on
something that can change between execs, AT_ARGMAX is inadequate for
that purpose for today's policy, let alone any future different policy.
> > The information that these give is about the conditions at startup.
> > That's what they mean to userland, and userland only uses them to know
> > the situation before it has made any calls. The definition of AT_EUID
> > is "effective user ID at program startup", and that fact does not
> > change.
>
> just for my own curiosity, where does this definition come from?
You mean documentation? I'm not really sure if there is any for that.
But it's the inherent definition of auxv that all its information can
only be about the conditions at program startup.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-15 9:27 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-15 9:27 UTC (permalink / raw)
To: Roland McGrath
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo
On 14 Sep 2010 at 16:04, Roland McGrath wrote:
> > userland could never rely on the kernel's policy at all since get_arg_page
> > could have failed for more reasons than overstepping the currently hardcoded
> > ARG_MAX check in there.
>
> I don't see how it could fail except for OOM cases where get_user_pages()
> failed rather than blocking. Is that what you mean?
yes but it's not only OOM (ENOMEM from some allocation), but it can be also
EPERM from LSM (if mmap_min_addr is set too high) or EFAULT from get_user_pages
(e.g., if VM_FAULT_HWPOISON was returned for a requested page).
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-15 9:27 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-15 9:27 UTC (permalink / raw)
To: Roland McGrath
Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo
On 14 Sep 2010 at 16:04, Roland McGrath wrote:
> > userland could never rely on the kernel's policy at all since get_arg_page
> > could have failed for more reasons than overstepping the currently hardcoded
> > ARG_MAX check in there.
>
> I don't see how it could fail except for OOM cases where get_user_pages()
> failed rather than blocking. Is that what you mean?
yes but it's not only OOM (ENOMEM from some allocation), but it can be also
EPERM from LSM (if mmap_min_addr is set too high) or EFAULT from get_user_pages
(e.g., if VM_FAULT_HWPOISON was returned for a requested page).
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10 9:18 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10 9:18 UTC (permalink / raw)
To: Brad Spengler
Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
KOSAKI Motohiro, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
> I still don't think this addresses the whole problem.
You're responding to one of three patches, and I said to begin with that
all together they only address part of the problem (not the OOM part).
So the import of this remark is somewhat mysterious.
> Without question, the rlimit / 4 check is bogus.
I question that assertion. For a non-RLIM_INFINITY limit, there is nothing
in particular wrong with it. The kernel is free to pick its upper bound
for ARG_MAX by whatever method. I don't see anything much to object to
about the rlimit/4 method. It has no useful effect for RLIM_INFINITY and
IMHO should not try to impose any limit in that case. But that's the only
thing I see a reason to change.
> If nobody agrees with the intent of that check, then it should be
> removed, but I think the better solution is to fix the check so that it
> matches its original intent: let the initial stack setup be up to 1/Xth
> of the min(rlimit, TASK_SIZE dependent upon personality), which allows
> space for additional stack setup in the ELF loader and then further
> growth once the process is live.
I see no reason to suspect this was the "original intent". It seems most
likely to me that the original intent was 1/4th the RLIMIT_STACK size, and
just nobody thought about what that meant when RLIMIT_STACK was RLIM_INFINITY.
> If that amount is overstepped, then the exec will return an error to the
> calling process instead of being terminated.
That's what happens now when RLIMIT_STACK is smaller, and that's what
people really care about. What you suggest would require some more
significant changes to the exec code path, touching all the binfmt modules
(though probably only binfmt_elf matters).
In the current structure of the code, the arch-dependent SET_PERSONALITY
macro used by {compat_,}binfmt_elf is the only place that knows what arch
bits to set for the new address space size. This is itself destructive,
but also runs after flush_old_exec (the point of no return). So you'd have
to reorganize things significantly, or add an entirely new arch macro tied
into struct binfmt somehow, or something like that.
> It might be useful to consult with the people who introduced/approved
> the check in the first place, as they seemed to have reasons for
> implementing it.
This was done in commit b6a2fea by Ollie Wild <aaw@google.com>:
mm: variable length argument support
It was part of going from a fixed maximum to no fixed maximum.
The log includes:
[a.p.zijlstra@chello.nl: limit stack size]
So perhaps it was Peter who devised the rlimit/4 idea.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10 9:18 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10 9:18 UTC (permalink / raw)
To: Brad Spengler
Cc: Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
pageexec-Y8qEzhMunLyT9ig0jae3mg,
Brad Spengler
<spender-JNS0hek0TMl4qEwOxq4T+Q@public.gmane.org> Eugene
Teo
> I still don't think this addresses the whole problem.
You're responding to one of three patches, and I said to begin with that
all together they only address part of the problem (not the OOM part).
So the import of this remark is somewhat mysterious.
> Without question, the rlimit / 4 check is bogus.
I question that assertion. For a non-RLIM_INFINITY limit, there is nothing
in particular wrong with it. The kernel is free to pick its upper bound
for ARG_MAX by whatever method. I don't see anything much to object to
about the rlimit/4 method. It has no useful effect for RLIM_INFINITY and
IMHO should not try to impose any limit in that case. But that's the only
thing I see a reason to change.
> If nobody agrees with the intent of that check, then it should be
> removed, but I think the better solution is to fix the check so that it
> matches its original intent: let the initial stack setup be up to 1/Xth
> of the min(rlimit, TASK_SIZE dependent upon personality), which allows
> space for additional stack setup in the ELF loader and then further
> growth once the process is live.
I see no reason to suspect this was the "original intent". It seems most
likely to me that the original intent was 1/4th the RLIMIT_STACK size, and
just nobody thought about what that meant when RLIMIT_STACK was RLIM_INFINITY.
> If that amount is overstepped, then the exec will return an error to the
> calling process instead of being terminated.
That's what happens now when RLIMIT_STACK is smaller, and that's what
people really care about. What you suggest would require some more
significant changes to the exec code path, touching all the binfmt modules
(though probably only binfmt_elf matters).
In the current structure of the code, the arch-dependent SET_PERSONALITY
macro used by {compat_,}binfmt_elf is the only place that knows what arch
bits to set for the new address space size. This is itself destructive,
but also runs after flush_old_exec (the point of no return). So you'd have
to reorganize things significantly, or add an entirely new arch macro tied
into struct binfmt somehow, or something like that.
> It might be useful to consult with the people who introduced/approved
> the check in the first place, as they seemed to have reasons for
> implementing it.
This was done in commit b6a2fea by Ollie Wild <aaw-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>:
mm: variable length argument support
It was part of going from a fixed maximum to no fixed maximum.
The log includes:
[a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org: limit stack size]
So perhaps it was Peter who devised the rlimit/4 idea.
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH 2/3] execve: improve interactivity with large arguments
2010-09-08 2:34 ` Roland McGrath
@ 2010-09-08 2:36 ` Roland McGrath
-1 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08 2:36 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
This adds a preemption point during the copying of the argument and
environment strings for execve, in copy_strings(). There is already
a preemption point in the count() loop, so this doesn't add any new
points in the abstract sense.
When the total argument+environment strings are very large, the time
spent copying them can be much more than a normal user time slice.
So this change improves the interactivity of the rest of the system
when one process is doing an execve with very large arguments.
Signed-off-by: Roland McGrath <roland@redhat.com>
---
fs/exec.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 1b63237..6f2d777 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -419,6 +419,8 @@ static int copy_strings(int argc, const char __user *const __user *argv,
while (len > 0) {
int offset, bytes_to_copy;
+ cond_resched();
+
offset = pos % PAGE_SIZE;
if (offset == 0)
offset = PAGE_SIZE;
^ permalink raw reply related [flat|nested] 81+ messages in thread
* [PATCH 2/3] execve: improve interactivity with large arguments
@ 2010-09-08 2:36 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08 2:36 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
This adds a preemption point during the copying of the argument and
environment strings for execve, in copy_strings(). There is already
a preemption point in the count() loop, so this doesn't add any new
points in the abstract sense.
When the total argument+environment strings are very large, the time
spent copying them can be much more than a normal user time slice.
So this change improves the interactivity of the rest of the system
when one process is doing an execve with very large arguments.
Signed-off-by: Roland McGrath <roland@redhat.com>
---
fs/exec.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 1b63237..6f2d777 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -419,6 +419,8 @@ static int copy_strings(int argc, const char __user *const __user *argv,
while (len > 0) {
int offset, bytes_to_copy;
+ cond_resched();
+
offset = pos % PAGE_SIZE;
if (offset == 0)
offset = PAGE_SIZE;
^ permalink raw reply related [flat|nested] 81+ messages in thread
* [PATCH 3/3] execve: make responsive to SIGKILL with large arguments
2010-09-08 2:34 ` Roland McGrath
@ 2010-09-08 2:37 ` Roland McGrath
-1 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08 2:37 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
An execve with a very large total of argument/environment strings
can take a really long time in the execve system call. It runs
uninterruptibly to count and copy all the strings. This change
makes it abort the exec quickly if sent a SIGKILL.
Note that this is the conservative change, to interrupt only for
SIGKILL, by using fatal_signal_pending(). It would be perfectly
correct semantics to let any signal interrupt the string-copying in
execve, i.e. use signal_pending() instead of fatal_signal_pending().
We'll save that change for later, since it could have user-visible
consequences, such as having a timer set too quickly make it so that
an execve can never complete, though it always happened to work before.
Signed-off-by: Roland McGrath <roland@redhat.com>
---
fs/exec.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 6f2d777..828dd24 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -376,6 +376,9 @@ static int count(const char __user * const __user * argv, int max)
argv++;
if (i++ >= max)
return -E2BIG;
+
+ if (fatal_signal_pending(current))
+ return -ERESTARTNOHAND;
cond_resched();
}
}
@@ -419,6 +422,10 @@ static int copy_strings(int argc, const char __user *const __user *argv,
while (len > 0) {
int offset, bytes_to_copy;
+ if (fatal_signal_pending(current)) {
+ ret = -ERESTARTNOHAND;
+ goto out;
+ }
cond_resched();
offset = pos % PAGE_SIZE;
^ permalink raw reply related [flat|nested] 81+ messages in thread
* [PATCH 3/3] execve: make responsive to SIGKILL with large arguments
@ 2010-09-08 2:37 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08 2:37 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
An execve with a very large total of argument/environment strings
can take a really long time in the execve system call. It runs
uninterruptibly to count and copy all the strings. This change
makes it abort the exec quickly if sent a SIGKILL.
Note that this is the conservative change, to interrupt only for
SIGKILL, by using fatal_signal_pending(). It would be perfectly
correct semantics to let any signal interrupt the string-copying in
execve, i.e. use signal_pending() instead of fatal_signal_pending().
We'll save that change for later, since it could have user-visible
consequences, such as having a timer set too quickly make it so that
an execve can never complete, though it always happened to work before.
Signed-off-by: Roland McGrath <roland@redhat.com>
---
fs/exec.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 6f2d777..828dd24 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -376,6 +376,9 @@ static int count(const char __user * const __user * argv, int max)
argv++;
if (i++ >= max)
return -E2BIG;
+
+ if (fatal_signal_pending(current))
+ return -ERESTARTNOHAND;
cond_resched();
}
}
@@ -419,6 +422,10 @@ static int copy_strings(int argc, const char __user *const __user *argv,
while (len > 0) {
int offset, bytes_to_copy;
+ if (fatal_signal_pending(current)) {
+ ret = -ERESTARTNOHAND;
+ goto out;
+ }
cond_resched();
offset = pos % PAGE_SIZE;
^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 0/3] execve argument-copying fixes
2010-09-08 2:34 ` Roland McGrath
@ 2010-09-08 3:00 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08 3:00 UTC (permalink / raw)
To: Roland McGrath
Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
> This is my take on parts of the execve large arguments copying issues
> that Kees posted about, and Brad and others have been discussing.
> I've only looked at the narrow area of the argument copying code
> itself. I think these are good and necessary fixes. But I'm not
> addressing the whole OOM killer/mm accounting issue, which also needs
> to be fixed (and I have the impression others are already looking into that).
>
> The following changes since commit d56557af19867edb8c0e96f8e26399698a08857f:
>
> Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 (2010-09-07 16:00:17 -0700)
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git topic/exec-fixes
>
> Roland McGrath (3):
> setup_arg_pages: diagnose excessive argument size
> execve: improve interactivity with large arguments
> execve: make responsive to SIGKILL with large arguments
>
> fs/exec.c | 14 ++++++++++++++
> 1 files changed, 14 insertions(+), 0 deletions(-)
All of changes looks nice to me :)
Thanks.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 0/3] execve argument-copying fixes
@ 2010-09-08 3:00 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08 3:00 UTC (permalink / raw)
To: Roland McGrath
Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net> Eugene Teo
> This is my take on parts of the execve large arguments copying issues
> that Kees posted about, and Brad and others have been discussing.
> I've only looked at the narrow area of the argument copying code
> itself. I think these are good and necessary fixes. But I'm not
> addressing the whole OOM killer/mm accounting issue, which also needs
> to be fixed (and I have the impression others are already looking into that).
>
> The following changes since commit d56557af19867edb8c0e96f8e26399698a08857f:
>
> Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 (2010-09-07 16:00:17 -0700)
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git topic/exec-fixes
>
> Roland McGrath (3):
> setup_arg_pages: diagnose excessive argument size
> execve: improve interactivity with large arguments
> execve: make responsive to SIGKILL with large arguments
>
> fs/exec.c | 14 ++++++++++++++
> 1 files changed, 14 insertions(+), 0 deletions(-)
All of changes looks nice to me :)
Thanks.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH 0/2] execve memory exhaust of argument-copying fixes
2010-09-08 2:34 ` Roland McGrath
@ 2010-09-09 5:01 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 5:01 UTC (permalink / raw)
To: Roland McGrath
Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> This is my take on parts of the execve large arguments copying issues
> that Kees posted about, and Brad and others have been discussing.
> I've only looked at the narrow area of the argument copying code
> itself. I think these are good and necessary fixes. But I'm not
> addressing the whole OOM killer/mm accounting issue, which also needs
> to be fixed (and I have the impression others are already looking into that).
Now, we have two OOM-Killer/mm acounting problem.
1) OOM-killer doesn't track nascent mm and It may kill innocent task
2) When execve argument-copying, our __vm_enough_memory() doesn't
protect any wrong plenty argument. then, execve() invoke OOM instead
return failure value when larger argument than system memory.
The patch series addressed this two issue.
^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH 0/2] execve memory exhaust of argument-copying fixes
@ 2010-09-09 5:01 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 5:01 UTC (permalink / raw)
To: Roland McGrath
Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> This is my take on parts of the execve large arguments copying issues
> that Kees posted about, and Brad and others have been discussing.
> I've only looked at the narrow area of the argument copying code
> itself. I think these are good and necessary fixes. But I'm not
> addressing the whole OOM killer/mm accounting issue, which also needs
> to be fixed (and I have the impression others are already looking into that).
Now, we have two OOM-Killer/mm acounting problem.
1) OOM-killer doesn't track nascent mm and It may kill innocent task
2) When execve argument-copying, our __vm_enough_memory() doesn't
protect any wrong plenty argument. then, execve() invoke OOM instead
return failure value when larger argument than system memory.
The patch series addressed this two issue.
^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH 1/2] oom: don't ignore rss in nascent mm
2010-09-09 5:01 ` KOSAKI Motohiro
@ 2010-09-09 5:03 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 5:03 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
This patch was made on top "oom: remove totalpage normalization from oom_badness()" patch.
===============================
Execve() makes new mm struct and setup stack and push argv vector,
Unfortunately this nascent mm is not pointed any tasks, then
OOM-killer can't detect this memory usage. therefore OOM-killer
may kill incorrect task.
Thus, this patch added task->in_exec_mm member and track
nascent mm usage.
Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/compat.c | 4 +++-
fs/exec.c | 14 +++++++++++++-
include/linux/binfmts.h | 1 +
include/linux/sched.h | 1 +
mm/oom_kill.c | 37 +++++++++++++++++++++++++++++--------
5 files changed, 47 insertions(+), 10 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 718c706..b631120 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1567,8 +1567,10 @@ int compat_do_execve(char * filename,
return retval;
out:
- if (bprm->mm)
+ if (bprm->mm) {
+ set_exec_mm(NULL);
mmput(bprm->mm);
+ }
out_file:
if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..b41834c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -347,6 +347,8 @@ int bprm_mm_init(struct linux_binprm *bprm)
if (err)
goto err;
+ set_exec_mm(bprm->mm);
+
return 0;
err:
@@ -983,6 +985,7 @@ int flush_old_exec(struct linux_binprm * bprm)
goto out;
bprm->mm = NULL; /* We're using it now */
+ set_exec_mm(NULL);
current->flags &= ~PF_RANDOMIZE;
flush_thread();
@@ -1314,6 +1317,13 @@ int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
EXPORT_SYMBOL(search_binary_handler);
+void set_exec_mm(struct mm_struct *mm)
+{
+ task_lock(current);
+ current->in_exec_mm = mm;
+ task_unlock(current);
+}
+
/*
* sys_execve() executes a new program.
*/
@@ -1402,8 +1412,10 @@ int do_execve(const char * filename,
return retval;
out:
- if (bprm->mm)
+ if (bprm->mm) {
+ set_exec_mm(NULL);
mmput (bprm->mm);
+ }
out_file:
if (bprm->file) {
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a065612..2fde1ba 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -133,6 +133,7 @@ extern void install_exec_creds(struct linux_binprm *bprm);
extern void do_coredump(long signr, int exit_code, struct pt_regs *regs);
extern void set_binfmt(struct linux_binfmt *new);
extern void free_bprm(struct linux_binprm *);
+extern void set_exec_mm(struct mm_struct *mm);
#endif /* __KERNEL__ */
#endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e61d60..bb5bf3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1226,6 +1226,7 @@ struct task_struct {
int pdeath_signal; /* The signal sent when the parent dies */
/* ??? */
unsigned int personality;
+ struct mm_struct *in_exec_mm;
unsigned did_exec:1;
unsigned in_execve:1; /* Tell the LSMs that the process is doing an
* execve */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c1beda0..7d38435 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -120,6 +120,33 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
return NULL;
}
+/*
+ * The baseline for the badness score is the proportion of RAM that each
+ * task's rss and swap space use.
+ */
+static unsigned long oom_rss_swap_usage(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ int mm_accounted = 0;
+ unsigned long points = 0;
+
+ do {
+ task_lock(t);
+ if (!mm_accounted && t->mm) {
+ points += get_mm_rss(t->mm);
+ points += get_mm_counter(t->mm, MM_SWAPENTS);
+ mm_accounted = 1;
+ }
+ if (t->in_exec_mm) {
+ points += get_mm_rss(t->in_exec_mm);
+ points += get_mm_counter(t->in_exec_mm, MM_SWAPENTS);
+ }
+ task_unlock(t);
+ } while_each_thread(p, t);
+
+ return points;
+}
+
/* return true if the task is not adequate as candidate victim task. */
static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask)
@@ -169,16 +196,10 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
if (p->flags & PF_OOM_ORIGIN)
return ULONG_MAX;
- p = find_lock_task_mm(p);
- if (!p)
+ points = oom_rss_swap_usage(p);
+ if (!points)
return 0;
- /*
- * The baseline for the badness score is the proportion of RAM that each
- * task's rss and swap space use.
- */
- points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS));
- task_unlock(p);
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
--
1.6.5.2
^ permalink raw reply related [flat|nested] 81+ messages in thread
* [PATCH 1/2] oom: don't ignore rss in nascent mm
@ 2010-09-09 5:03 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 5:03 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
This patch was made on top "oom: remove totalpage normalization from oom_badness()" patch.
===============================
Execve() makes new mm struct and setup stack and push argv vector,
Unfortunately this nascent mm is not pointed any tasks, then
OOM-killer can't detect this memory usage. therefore OOM-killer
may kill incorrect task.
Thus, this patch added task->in_exec_mm member and track
nascent mm usage.
Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/compat.c | 4 +++-
fs/exec.c | 14 +++++++++++++-
include/linux/binfmts.h | 1 +
include/linux/sched.h | 1 +
mm/oom_kill.c | 37 +++++++++++++++++++++++++++++--------
5 files changed, 47 insertions(+), 10 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 718c706..b631120 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1567,8 +1567,10 @@ int compat_do_execve(char * filename,
return retval;
out:
- if (bprm->mm)
+ if (bprm->mm) {
+ set_exec_mm(NULL);
mmput(bprm->mm);
+ }
out_file:
if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..b41834c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -347,6 +347,8 @@ int bprm_mm_init(struct linux_binprm *bprm)
if (err)
goto err;
+ set_exec_mm(bprm->mm);
+
return 0;
err:
@@ -983,6 +985,7 @@ int flush_old_exec(struct linux_binprm * bprm)
goto out;
bprm->mm = NULL; /* We're using it now */
+ set_exec_mm(NULL);
current->flags &= ~PF_RANDOMIZE;
flush_thread();
@@ -1314,6 +1317,13 @@ int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
EXPORT_SYMBOL(search_binary_handler);
+void set_exec_mm(struct mm_struct *mm)
+{
+ task_lock(current);
+ current->in_exec_mm = mm;
+ task_unlock(current);
+}
+
/*
* sys_execve() executes a new program.
*/
@@ -1402,8 +1412,10 @@ int do_execve(const char * filename,
return retval;
out:
- if (bprm->mm)
+ if (bprm->mm) {
+ set_exec_mm(NULL);
mmput (bprm->mm);
+ }
out_file:
if (bprm->file) {
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a065612..2fde1ba 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -133,6 +133,7 @@ extern void install_exec_creds(struct linux_binprm *bprm);
extern void do_coredump(long signr, int exit_code, struct pt_regs *regs);
extern void set_binfmt(struct linux_binfmt *new);
extern void free_bprm(struct linux_binprm *);
+extern void set_exec_mm(struct mm_struct *mm);
#endif /* __KERNEL__ */
#endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e61d60..bb5bf3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1226,6 +1226,7 @@ struct task_struct {
int pdeath_signal; /* The signal sent when the parent dies */
/* ??? */
unsigned int personality;
+ struct mm_struct *in_exec_mm;
unsigned did_exec:1;
unsigned in_execve:1; /* Tell the LSMs that the process is doing an
* execve */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c1beda0..7d38435 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -120,6 +120,33 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
return NULL;
}
+/*
+ * The baseline for the badness score is the proportion of RAM that each
+ * task's rss and swap space use.
+ */
+static unsigned long oom_rss_swap_usage(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ int mm_accounted = 0;
+ unsigned long points = 0;
+
+ do {
+ task_lock(t);
+ if (!mm_accounted && t->mm) {
+ points += get_mm_rss(t->mm);
+ points += get_mm_counter(t->mm, MM_SWAPENTS);
+ mm_accounted = 1;
+ }
+ if (t->in_exec_mm) {
+ points += get_mm_rss(t->in_exec_mm);
+ points += get_mm_counter(t->in_exec_mm, MM_SWAPENTS);
+ }
+ task_unlock(t);
+ } while_each_thread(p, t);
+
+ return points;
+}
+
/* return true if the task is not adequate as candidate victim task. */
static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask)
@@ -169,16 +196,10 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
if (p->flags & PF_OOM_ORIGIN)
return ULONG_MAX;
- p = find_lock_task_mm(p);
- if (!p)
+ points = oom_rss_swap_usage(p);
+ if (!points)
return 0;
- /*
- * The baseline for the badness score is the proportion of RAM that each
- * task's rss and swap space use.
- */
- points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS));
- task_unlock(p);
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
--
1.6.5.2
^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] oom: don't ignore rss in nascent mm
2010-09-09 5:03 ` KOSAKI Motohiro
@ 2010-09-09 22:05 ` Oleg Nesterov
-1 siblings, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2010-09-09 22:05 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Roland McGrath, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
On 09/09, KOSAKI Motohiro wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1226,6 +1226,7 @@ struct task_struct {
> int pdeath_signal; /* The signal sent when the parent dies */
> /* ??? */
> unsigned int personality;
> + struct mm_struct *in_exec_mm;
Oh, yes, this has to be per-thread (unlike ->mm, btw).
I wonder if it makes sense to move ->cred_guard_mutex from task_struct
to signal_struct and thus make multiple-threads-inside-exec impossible.
Only one thread can win anyway.
Oleg.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] oom: don't ignore rss in nascent mm
@ 2010-09-09 22:05 ` Oleg Nesterov
0 siblings, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2010-09-09 22:05 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Roland McGrath, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
On 09/09, KOSAKI Motohiro wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1226,6 +1226,7 @@ struct task_struct {
> int pdeath_signal; /* The signal sent when the parent dies */
> /* ??? */
> unsigned int personality;
> + struct mm_struct *in_exec_mm;
Oh, yes, this has to be per-thread (unlike ->mm, btw).
I wonder if it makes sense to move ->cred_guard_mutex from task_struct
to signal_struct and thus make multiple-threads-inside-exec impossible.
Only one thread can win anyway.
Oleg.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] oom: don't ignore rss in nascent mm
@ 2010-09-10 9:39 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10 9:39 UTC (permalink / raw)
To: Oleg Nesterov
Cc: KOSAKI Motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
linux-fsdevel, pageexec, Brad Spengler, Eugene Teo,
KAMEZAWA Hiroyuki
> I wonder if it makes sense to move ->cred_guard_mutex from task_struct
> to signal_struct and thus make multiple-threads-inside-exec impossible.
> Only one thread can win anyway.
That probably makes sense. Note that cred_guard_mutex is also overloaded
for ptrace_attach, so this would add some more serialization of attaches to
threads in the same group. But as long as actual attachment serializes on
tasklist_lock anyway, it doesn't make a material difference. (Even without
that, it would presumably be the same debugger attaching serially to
threads in the same group, so it wouldn't degrade anything in practice.)
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] oom: don't ignore rss in nascent mm
@ 2010-09-10 9:39 ` Roland McGrath
0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10 9:39 UTC (permalink / raw)
To: Oleg Nesterov
Cc: KOSAKI Motohiro, Linus Torvalds, Andrew Morton,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
Kees Cook, Al Viro, Neil Horman,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
pageexec-Y8qEzhMunLyT9ig0jae3mg, Brad Spengler, Eugene Teo,
KAMEZAWA Hiroyuki
> I wonder if it makes sense to move ->cred_guard_mutex from task_struct
> to signal_struct and thus make multiple-threads-inside-exec impossible.
> Only one thread can win anyway.
That probably makes sense. Note that cred_guard_mutex is also overloaded
for ptrace_attach, so this would add some more serialization of attaches to
threads in the same group. But as long as actual attachment serializes on
tasklist_lock anyway, it doesn't make a material difference. (Even without
that, it would presumably be the same debugger attaching serially to
threads in the same group, so it wouldn't degrade anything in practice.)
Thanks,
Roland
^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH] move cred_guard_mutex from task_struct to signal_struct
2010-09-09 22:05 ` Oleg Nesterov
@ 2010-09-10 9:57 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 9:57 UTC (permalink / raw)
To: Oleg Nesterov
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> On 09/09, KOSAKI Motohiro wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1226,6 +1226,7 @@ struct task_struct {
> > int pdeath_signal; /* The signal sent when the parent dies */
> > /* ??? */
> > unsigned int personality;
> > + struct mm_struct *in_exec_mm;
>
> Oh, yes, this has to be per-thread (unlike ->mm, btw).
>
> I wonder if it makes sense to move ->cred_guard_mutex from task_struct
> to signal_struct and thus make multiple-threads-inside-exec impossible.
> Only one thread can win anyway.
Interesting idea, really.
I've thought
1) moving cread_guard_mutex itself
- no increase execve overhead
-> very good
- it also prevent parallel ptrace
-> probably no problem. nobody want parallel debugging
2) move in_exec_mm to signal_struct too
-> very hard. oom-killer can use very few lock because it's called
from various place. now both ->mm and ->in_exec_mm are protected
task_lock() and it help to avoid messy.
So, I've done only (1). I think this restriction effectivly prevent
some theorical execve() resouce smashing attack.
======================================================================
Subject: [PATCH] move cred_guard_mutex from task_struct to signal_struct
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent execve()
has no worth.
Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/exec.c | 8 ++++----
fs/proc/base.c | 8 ++++----
include/linux/init_task.h | 4 ++--
include/linux/sched.h | 7 ++++---
kernel/cred.c | 2 --
kernel/fork.c | 2 ++
kernel/ptrace.c | 4 ++--
7 files changed, 18 insertions(+), 17 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 81d0d06..052ed54 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1091,14 +1091,14 @@ EXPORT_SYMBOL(setup_new_exec);
*/
int prepare_bprm_creds(struct linux_binprm *bprm)
{
- if (mutex_lock_interruptible(¤t->cred_guard_mutex))
+ if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
- mutex_unlock(¤t->cred_guard_mutex);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
return -ENOMEM;
}
@@ -1106,7 +1106,7 @@ void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
- mutex_unlock(¤t->cred_guard_mutex);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
kfree(bprm);
@@ -1127,7 +1127,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->cred_guard_mutex);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 55a16f2..fd97c8f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -226,7 +226,7 @@ struct mm_struct *mm_for_maps(struct task_struct *task)
{
struct mm_struct *mm;
- if (mutex_lock_killable(&task->cred_guard_mutex))
+ if (mutex_lock_killable(&task->signal->cred_guard_mutex))
return NULL;
mm = get_task_mm(task);
@@ -235,7 +235,7 @@ struct mm_struct *mm_for_maps(struct task_struct *task)
mmput(mm);
mm = NULL;
}
- mutex_unlock(&task->cred_guard_mutex);
+ mutex_unlock(&task->signal->cred_guard_mutex);
return mm;
}
@@ -2273,14 +2273,14 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
goto out_free;
/* Guard against adverse ptrace interaction */
- length = mutex_lock_interruptible(&task->cred_guard_mutex);
+ length = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
if (length < 0)
goto out_free;
length = security_setprocattr(task,
(char*)file->f_path.dentry->d_name.name,
(void*)page, count);
- mutex_unlock(&task->cred_guard_mutex);
+ mutex_unlock(&task->signal->cred_guard_mutex);
out_free:
free_page((unsigned long) page);
out:
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f43fa5..ff3cc33 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -29,6 +29,8 @@ extern struct fs_struct init_fs;
.running = 0, \
.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock), \
}, \
+ .cred_guard_mutex = \
+ __MUTEX_INITIALIZER(sig.cred_guard_mutex), \
}
extern struct nsproxy init_nsproxy;
@@ -139,8 +141,6 @@ extern struct cred init_cred;
.group_leader = &tsk, \
.real_cred = &init_cred, \
.cred = &init_cred, \
- .cred_guard_mutex = \
- __MUTEX_INITIALIZER(tsk.cred_guard_mutex), \
.comm = "swapper", \
.thread = INIT_THREAD, \
.fs = &init_fs, \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb5bf3d..a7e7c2a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -623,6 +623,10 @@ struct signal_struct {
int oom_adj; /* OOM kill score adjustment (bit shift) */
long oom_score_adj; /* OOM kill score adjustment */
+
+ struct mutex cred_guard_mutex; /* guard against foreign influences on
+ * credential calculations
+ * (notably. ptrace) */
};
/* Context switch must be unlocked if interrupts are to be enabled */
@@ -1293,9 +1297,6 @@ struct task_struct {
* credentials (COW) */
const struct cred *cred; /* effective (overridable) subjective task
* credentials (COW) */
- struct mutex cred_guard_mutex; /* guard against foreign influences on
- * credential calculations
- * (notably. ptrace) */
struct cred *replacement_session_keyring; /* for KEYCTL_SESSION_TO_PARENT */
char comm[TASK_COMM_LEN]; /* executable name excluding path
diff --git a/kernel/cred.c b/kernel/cred.c
index 9a3e226..a81833d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -384,8 +384,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
struct cred *new;
int ret;
- mutex_init(&p->cred_guard_mutex);
-
if (
#ifdef CONFIG_KEYS
!p->cred->thread_keyring &&
diff --git a/kernel/fork.c b/kernel/fork.c
index b7e9d60..4c0d3ea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -904,6 +904,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
+ mutex_init(&sig->cred_guard_mutex);
+
return 0;
}
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index f34d798..ac5013a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -181,7 +181,7 @@ int ptrace_attach(struct task_struct *task)
* under ptrace.
*/
retval = -ERESTARTNOINTR;
- if (mutex_lock_interruptible(&task->cred_guard_mutex))
+ if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
goto out;
task_lock(task);
@@ -208,7 +208,7 @@ int ptrace_attach(struct task_struct *task)
unlock_tasklist:
write_unlock_irq(&tasklist_lock);
unlock_creds:
- mutex_unlock(&task->cred_guard_mutex);
+ mutex_unlock(&task->signal->cred_guard_mutex);
out:
return retval;
}
--
1.6.5.2
^ permalink raw reply related [flat|nested] 81+ messages in thread
* [PATCH] move cred_guard_mutex from task_struct to signal_struct
@ 2010-09-10 9:57 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 9:57 UTC (permalink / raw)
To: Oleg Nesterov
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> On 09/09, KOSAKI Motohiro wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1226,6 +1226,7 @@ struct task_struct {
> > int pdeath_signal; /* The signal sent when the parent dies */
> > /* ??? */
> > unsigned int personality;
> > + struct mm_struct *in_exec_mm;
>
> Oh, yes, this has to be per-thread (unlike ->mm, btw).
>
> I wonder if it makes sense to move ->cred_guard_mutex from task_struct
> to signal_struct and thus make multiple-threads-inside-exec impossible.
> Only one thread can win anyway.
Interesting idea, really.
I've thought
1) moving cread_guard_mutex itself
- no increase execve overhead
-> very good
- it also prevent parallel ptrace
-> probably no problem. nobody want parallel debugging
2) move in_exec_mm to signal_struct too
-> very hard. oom-killer can use very few lock because it's called
from various place. now both ->mm and ->in_exec_mm are protected
task_lock() and it help to avoid messy.
So, I've done only (1). I think this restriction effectivly prevent
some theorical execve() resouce smashing attack.
======================================================================
Subject: [PATCH] move cred_guard_mutex from task_struct to signal_struct
Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent execve()
has no worth.
Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/exec.c | 8 ++++----
fs/proc/base.c | 8 ++++----
include/linux/init_task.h | 4 ++--
include/linux/sched.h | 7 ++++---
kernel/cred.c | 2 --
kernel/fork.c | 2 ++
kernel/ptrace.c | 4 ++--
7 files changed, 18 insertions(+), 17 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 81d0d06..052ed54 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1091,14 +1091,14 @@ EXPORT_SYMBOL(setup_new_exec);
*/
int prepare_bprm_creds(struct linux_binprm *bprm)
{
- if (mutex_lock_interruptible(¤t->cred_guard_mutex))
+ if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
- mutex_unlock(¤t->cred_guard_mutex);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
return -ENOMEM;
}
@@ -1106,7 +1106,7 @@ void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
- mutex_unlock(¤t->cred_guard_mutex);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
kfree(bprm);
@@ -1127,7 +1127,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
- mutex_unlock(¤t->cred_guard_mutex);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 55a16f2..fd97c8f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -226,7 +226,7 @@ struct mm_struct *mm_for_maps(struct task_struct *task)
{
struct mm_struct *mm;
- if (mutex_lock_killable(&task->cred_guard_mutex))
+ if (mutex_lock_killable(&task->signal->cred_guard_mutex))
return NULL;
mm = get_task_mm(task);
@@ -235,7 +235,7 @@ struct mm_struct *mm_for_maps(struct task_struct *task)
mmput(mm);
mm = NULL;
}
- mutex_unlock(&task->cred_guard_mutex);
+ mutex_unlock(&task->signal->cred_guard_mutex);
return mm;
}
@@ -2273,14 +2273,14 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
goto out_free;
/* Guard against adverse ptrace interaction */
- length = mutex_lock_interruptible(&task->cred_guard_mutex);
+ length = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
if (length < 0)
goto out_free;
length = security_setprocattr(task,
(char*)file->f_path.dentry->d_name.name,
(void*)page, count);
- mutex_unlock(&task->cred_guard_mutex);
+ mutex_unlock(&task->signal->cred_guard_mutex);
out_free:
free_page((unsigned long) page);
out:
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f43fa5..ff3cc33 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -29,6 +29,8 @@ extern struct fs_struct init_fs;
.running = 0, \
.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock), \
}, \
+ .cred_guard_mutex = \
+ __MUTEX_INITIALIZER(sig.cred_guard_mutex), \
}
extern struct nsproxy init_nsproxy;
@@ -139,8 +141,6 @@ extern struct cred init_cred;
.group_leader = &tsk, \
.real_cred = &init_cred, \
.cred = &init_cred, \
- .cred_guard_mutex = \
- __MUTEX_INITIALIZER(tsk.cred_guard_mutex), \
.comm = "swapper", \
.thread = INIT_THREAD, \
.fs = &init_fs, \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb5bf3d..a7e7c2a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -623,6 +623,10 @@ struct signal_struct {
int oom_adj; /* OOM kill score adjustment (bit shift) */
long oom_score_adj; /* OOM kill score adjustment */
+
+ struct mutex cred_guard_mutex; /* guard against foreign influences on
+ * credential calculations
+ * (notably. ptrace) */
};
/* Context switch must be unlocked if interrupts are to be enabled */
@@ -1293,9 +1297,6 @@ struct task_struct {
* credentials (COW) */
const struct cred *cred; /* effective (overridable) subjective task
* credentials (COW) */
- struct mutex cred_guard_mutex; /* guard against foreign influences on
- * credential calculations
- * (notably. ptrace) */
struct cred *replacement_session_keyring; /* for KEYCTL_SESSION_TO_PARENT */
char comm[TASK_COMM_LEN]; /* executable name excluding path
diff --git a/kernel/cred.c b/kernel/cred.c
index 9a3e226..a81833d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -384,8 +384,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
struct cred *new;
int ret;
- mutex_init(&p->cred_guard_mutex);
-
if (
#ifdef CONFIG_KEYS
!p->cred->thread_keyring &&
diff --git a/kernel/fork.c b/kernel/fork.c
index b7e9d60..4c0d3ea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -904,6 +904,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
+ mutex_init(&sig->cred_guard_mutex);
+
return 0;
}
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index f34d798..ac5013a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -181,7 +181,7 @@ int ptrace_attach(struct task_struct *task)
* under ptrace.
*/
retval = -ERESTARTNOINTR;
- if (mutex_lock_interruptible(&task->cred_guard_mutex))
+ if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
goto out;
task_lock(task);
@@ -208,7 +208,7 @@ int ptrace_attach(struct task_struct *task)
unlock_tasklist:
write_unlock_irq(&tasklist_lock);
unlock_creds:
- mutex_unlock(&task->cred_guard_mutex);
+ mutex_unlock(&task->signal->cred_guard_mutex);
out:
return retval;
}
--
1.6.5.2
^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH] move cred_guard_mutex from task_struct to signal_struct
2010-09-10 9:57 ` KOSAKI Motohiro
@ 2010-09-10 17:24 ` Oleg Nesterov
-1 siblings, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2010-09-10 17:24 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Roland McGrath, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
On 09/10, KOSAKI Motohiro wrote:
>
> 1) moving cread_guard_mutex itself
> - no increase execve overhead
> -> very good
> - it also prevent parallel ptrace
No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
pointed out it also needs write_lock(tasklist) which is worse. So
this change doesn't make any practical harm for ptrace.
> 2) move in_exec_mm to signal_struct too
> -> very hard. oom-killer can use very few lock because it's called
> from various place. now both ->mm and ->in_exec_mm are protected
> task_lock() and it help to avoid messy.
Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
anyway, this means it can't race with de_thread() changing the leader.
But up to you.
Another very minor nit (but again, up to you). Perhaps exec_mmap()
could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
matter), it takes task_lock(current) anyway (and at this point current
is always the group leader).
> Let's move ->cred_guard_mutex from task_struct to signal_struct. It
> naturally prevent multiple-threads-inside-exec.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
This is very minor, but perhaps you can also fix a couple of comments
which mention task->cred_guard_mutex,
fs/exec.c:1109 the caller must hold current->cred_guard_mutex
kernel/cred.c:328 The caller must hold current->cred_guard_mutex
include/linux/tracehook.h:153 @task->cred_guard_mutex
Oleg.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH] move cred_guard_mutex from task_struct to signal_struct
@ 2010-09-10 17:24 ` Oleg Nesterov
0 siblings, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2010-09-10 17:24 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Roland McGrath, Linus Torvalds, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
On 09/10, KOSAKI Motohiro wrote:
>
> 1) moving cread_guard_mutex itself
> - no increase execve overhead
> -> very good
> - it also prevent parallel ptrace
No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
pointed out it also needs write_lock(tasklist) which is worse. So
this change doesn't make any practical harm for ptrace.
> 2) move in_exec_mm to signal_struct too
> -> very hard. oom-killer can use very few lock because it's called
> from various place. now both ->mm and ->in_exec_mm are protected
> task_lock() and it help to avoid messy.
Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
anyway, this means it can't race with de_thread() changing the leader.
But up to you.
Another very minor nit (but again, up to you). Perhaps exec_mmap()
could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
matter), it takes task_lock(current) anyway (and at this point current
is always the group leader).
> Let's move ->cred_guard_mutex from task_struct to signal_struct. It
> naturally prevent multiple-threads-inside-exec.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
This is very minor, but perhaps you can also fix a couple of comments
which mention task->cred_guard_mutex,
fs/exec.c:1109 the caller must hold current->cred_guard_mutex
kernel/cred.c:328 The caller must hold current->cred_guard_mutex
include/linux/tracehook.h:153 @task->cred_guard_mutex
Oleg.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH] move cred_guard_mutex from task_struct to signal_struct
2010-09-10 17:24 ` Oleg Nesterov
@ 2010-09-16 5:51 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16 5:51 UTC (permalink / raw)
To: Oleg Nesterov
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> On 09/10, KOSAKI Motohiro wrote:
> >
> > 1) moving cread_guard_mutex itself
> > - no increase execve overhead
> > -> very good
> > - it also prevent parallel ptrace
>
> No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
> pointed out it also needs write_lock(tasklist) which is worse. So
> this change doesn't make any practical harm for ptrace.
I see, thanks.
>
> > 2) move in_exec_mm to signal_struct too
> > -> very hard. oom-killer can use very few lock because it's called
> > from various place. now both ->mm and ->in_exec_mm are protected
> > task_lock() and it help to avoid messy.
>
> Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
> you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
> anyway, this means it can't race with de_thread() changing the leader.
> But up to you.
Good idea. will fix.
>
> Another very minor nit (but again, up to you). Perhaps exec_mmap()
> could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
> matter), it takes task_lock(current) anyway (and at this point current
> is always the group leader).
Thanks. will fix.
>
> > Let's move ->cred_guard_mutex from task_struct to signal_struct. It
> > naturally prevent multiple-threads-inside-exec.
>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
>
>
> This is very minor, but perhaps you can also fix a couple of comments
> which mention task->cred_guard_mutex,
>
> fs/exec.c:1109 the caller must hold current->cred_guard_mutex
> kernel/cred.c:328 The caller must hold current->cred_guard_mutex
> include/linux/tracehook.h:153 @task->cred_guard_mutex
Will fix, of cource.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH] move cred_guard_mutex from task_struct to signal_struct
@ 2010-09-16 5:51 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16 5:51 UTC (permalink / raw)
To: Oleg Nesterov
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> On 09/10, KOSAKI Motohiro wrote:
> >
> > 1) moving cread_guard_mutex itself
> > - no increase execve overhead
> > -> very good
> > - it also prevent parallel ptrace
>
> No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
> pointed out it also needs write_lock(tasklist) which is worse. So
> this change doesn't make any practical harm for ptrace.
I see, thanks.
>
> > 2) move in_exec_mm to signal_struct too
> > -> very hard. oom-killer can use very few lock because it's called
> > from various place. now both ->mm and ->in_exec_mm are protected
> > task_lock() and it help to avoid messy.
>
> Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
> you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
> anyway, this means it can't race with de_thread() changing the leader.
> But up to you.
Good idea. will fix.
>
> Another very minor nit (but again, up to you). Perhaps exec_mmap()
> could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
> matter), it takes task_lock(current) anyway (and at this point current
> is always the group leader).
Thanks. will fix.
>
> > Let's move ->cred_guard_mutex from task_struct to signal_struct. It
> > naturally prevent multiple-threads-inside-exec.
>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
>
>
> This is very minor, but perhaps you can also fix a couple of comments
> which mention task->cred_guard_mutex,
>
> fs/exec.c:1109 the caller must hold current->cred_guard_mutex
> kernel/cred.c:328 The caller must hold current->cred_guard_mutex
> include/linux/tracehook.h:153 @task->cred_guard_mutex
Will fix, of cource.
^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH 2/2] execve: check the VM has enough memory at first
2010-09-09 5:01 ` KOSAKI Motohiro
@ 2010-09-09 5:04 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 5:04 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
Now, Argv size of execve are limited by STACK_LIMIT/4. In other
words, If we are setting 'ulimit -s unlimited', we've lost any
guard of argv size.
More unfortunately, current argv setup logic is bypassing the VM
overcommitment check unintentionally. because current overcommitment
check don't care gradually increased stack.
therefore, wrong argument of execve() easily makes OOM instead
execve() failure. that's bad.
After this patch, execve() expand stack at first and receive to
check vm_enough_memory() properly. then, too long argument of
execve() than the machine memory return EFAULT properly.
Note, almost all user are using OVERCOMMIT_GUESS overcommit mode
(because it's default). It provide only guess. It doesn't works
perfectly on some case. However usually execve() failure is better
than OOM-killer and swap-storm compbination.
Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/compat.c | 38 +++++++++++++++++++++++++++++++-------
fs/exec.c | 38 +++++++++++++++++++++++++++++++-------
2 files changed, 62 insertions(+), 14 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index b631120..ff32573 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1394,10 +1394,39 @@ static int compat_copy_strings(int argc, compat_uptr_t __user *argv,
char *kaddr = NULL;
unsigned long kpos = 0;
int ret;
+ compat_uptr_t str;
+ int len;
+ int i;
+ unsigned long total_len = 0;
+
+ for (i = 0; i < argc; i++) {
+ ret = -EFAULT;
+ if (get_user(str, argv+i))
+ goto out;
+ len = strnlen_user(compat_ptr(str), MAX_ARG_STRLEN);
+ if (!len)
+ goto out;
+
+ ret = -E2BIG;
+ if (len > MAX_ARG_STRLEN)
+ goto out;
+
+ total_len += len;
+ if (total_len > bprm->p)
+ goto out;
+ }
+
+ /*
+ * Firstly, we try to expand stack. It also invoke
+ * security_vm_enough_memory() and get failure when we don't
+ * have enough space. It help to avoid stack smashing by plenty argv.
+ */
+ ret = get_user_pages(current, bprm->mm, bprm->p - total_len,
+ 1, 1, 1, NULL, NULL);
+ if (ret < 0)
+ goto out;
while (argc-- > 0) {
- compat_uptr_t str;
- int len;
unsigned long pos;
if (get_user(str, argv+argc) ||
@@ -1406,11 +1435,6 @@ static int compat_copy_strings(int argc, compat_uptr_t __user *argv,
goto out;
}
- if (len > MAX_ARG_STRLEN) {
- ret = -E2BIG;
- goto out;
- }
-
/* We're going to work our way backwords. */
pos = bprm->p;
str += len;
diff --git a/fs/exec.c b/fs/exec.c
index b41834c..ef8b9dc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -396,10 +396,39 @@ static int copy_strings(int argc, const char __user *const __user *argv,
char *kaddr = NULL;
unsigned long kpos = 0;
int ret;
+ int i;
+ unsigned long total_len = 0;
+ const char __user *str;
+ int len;
+
+ for (i = 0; i < argc; i++) {
+ ret = -EFAULT;
+ if (get_user(str, argv+i))
+ goto out;
+ len = strnlen_user(str, MAX_ARG_STRLEN);
+ if (!len)
+ goto out;
+
+ ret = -E2BIG;
+ if (!valid_arg_len(bprm, len))
+ goto out;
+
+ total_len += len;
+ if (total_len > bprm->p)
+ goto out;
+ }
+
+ /*
+ * Firstly, we try to expand stack. It also invoke
+ * security_vm_enough_memory() and get failure when we don't
+ * have enough space. It help to avoid stack smashing by plenty argv.
+ */
+ ret = get_user_pages(current, bprm->mm, bprm->p - total_len,
+ 1, 1, 1, NULL, NULL);
+ if (ret < 0)
+ goto out;
while (argc-- > 0) {
- const char __user *str;
- int len;
unsigned long pos;
if (get_user(str, argv+argc) ||
@@ -408,11 +437,6 @@ static int copy_strings(int argc, const char __user *const __user *argv,
goto out;
}
- if (!valid_arg_len(bprm, len)) {
- ret = -E2BIG;
- goto out;
- }
-
/* We're going to work our way backwords. */
pos = bprm->p;
str += len;
--
1.6.5.2
^ permalink raw reply related [flat|nested] 81+ messages in thread
* [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-09 5:04 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 5:04 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
Now, Argv size of execve are limited by STACK_LIMIT/4. In other
words, If we are setting 'ulimit -s unlimited', we've lost any
guard of argv size.
More unfortunately, current argv setup logic is bypassing the VM
overcommitment check unintentionally. because current overcommitment
check don't care gradually increased stack.
therefore, wrong argument of execve() easily makes OOM instead
execve() failure. that's bad.
After this patch, execve() expand stack at first and receive to
check vm_enough_memory() properly. then, too long argument of
execve() than the machine memory return EFAULT properly.
Note, almost all user are using OVERCOMMIT_GUESS overcommit mode
(because it's default). It provide only guess. It doesn't works
perfectly on some case. However usually execve() failure is better
than OOM-killer and swap-storm compbination.
Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
fs/compat.c | 38 +++++++++++++++++++++++++++++++-------
fs/exec.c | 38 +++++++++++++++++++++++++++++++-------
2 files changed, 62 insertions(+), 14 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index b631120..ff32573 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1394,10 +1394,39 @@ static int compat_copy_strings(int argc, compat_uptr_t __user *argv,
char *kaddr = NULL;
unsigned long kpos = 0;
int ret;
+ compat_uptr_t str;
+ int len;
+ int i;
+ unsigned long total_len = 0;
+
+ for (i = 0; i < argc; i++) {
+ ret = -EFAULT;
+ if (get_user(str, argv+i))
+ goto out;
+ len = strnlen_user(compat_ptr(str), MAX_ARG_STRLEN);
+ if (!len)
+ goto out;
+
+ ret = -E2BIG;
+ if (len > MAX_ARG_STRLEN)
+ goto out;
+
+ total_len += len;
+ if (total_len > bprm->p)
+ goto out;
+ }
+
+ /*
+ * Firstly, we try to expand stack. It also invoke
+ * security_vm_enough_memory() and get failure when we don't
+ * have enough space. It help to avoid stack smashing by plenty argv.
+ */
+ ret = get_user_pages(current, bprm->mm, bprm->p - total_len,
+ 1, 1, 1, NULL, NULL);
+ if (ret < 0)
+ goto out;
while (argc-- > 0) {
- compat_uptr_t str;
- int len;
unsigned long pos;
if (get_user(str, argv+argc) ||
@@ -1406,11 +1435,6 @@ static int compat_copy_strings(int argc, compat_uptr_t __user *argv,
goto out;
}
- if (len > MAX_ARG_STRLEN) {
- ret = -E2BIG;
- goto out;
- }
-
/* We're going to work our way backwords. */
pos = bprm->p;
str += len;
diff --git a/fs/exec.c b/fs/exec.c
index b41834c..ef8b9dc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -396,10 +396,39 @@ static int copy_strings(int argc, const char __user *const __user *argv,
char *kaddr = NULL;
unsigned long kpos = 0;
int ret;
+ int i;
+ unsigned long total_len = 0;
+ const char __user *str;
+ int len;
+
+ for (i = 0; i < argc; i++) {
+ ret = -EFAULT;
+ if (get_user(str, argv+i))
+ goto out;
+ len = strnlen_user(str, MAX_ARG_STRLEN);
+ if (!len)
+ goto out;
+
+ ret = -E2BIG;
+ if (!valid_arg_len(bprm, len))
+ goto out;
+
+ total_len += len;
+ if (total_len > bprm->p)
+ goto out;
+ }
+
+ /*
+ * Firstly, we try to expand stack. It also invoke
+ * security_vm_enough_memory() and get failure when we don't
+ * have enough space. It help to avoid stack smashing by plenty argv.
+ */
+ ret = get_user_pages(current, bprm->mm, bprm->p - total_len,
+ 1, 1, 1, NULL, NULL);
+ if (ret < 0)
+ goto out;
while (argc-- > 0) {
- const char __user *str;
- int len;
unsigned long pos;
if (get_user(str, argv+argc) ||
@@ -408,11 +437,6 @@ static int copy_strings(int argc, const char __user *const __user *argv,
goto out;
}
- if (!valid_arg_len(bprm, len)) {
- ret = -E2BIG;
- goto out;
- }
-
/* We're going to work our way backwords. */
pos = bprm->p;
str += len;
--
1.6.5.2
^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 2/2] execve: check the VM has enough memory at first
2010-09-09 5:04 ` KOSAKI Motohiro
@ 2010-09-10 15:06 ` Linus Torvalds
-1 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2010-09-10 15:06 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Roland McGrath, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>
> After this patch, execve() expand stack at first and receive to
> check vm_enough_memory() properly. then, too long argument of
> execve() than the machine memory return EFAULT properly.
This is horrible. We don't want to walk the arguments one more time
just for this. Let's just improve the checks that we do as we go
along.
Linus
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-10 15:06 ` Linus Torvalds
0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2010-09-10 15:06 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Roland McGrath, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>
> After this patch, execve() expand stack at first and receive to
> check vm_enough_memory() properly. then, too long argument of
> execve() than the machine memory return EFAULT properly.
This is horrible. We don't want to walk the arguments one more time
just for this. Let's just improve the checks that we do as we go
along.
Linus
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 2/2] execve: check the VM has enough memory at first
2010-09-10 15:06 ` Linus Torvalds
@ 2010-09-14 1:52 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14 1:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: kosaki.motohiro, Roland McGrath, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> > After this patch, execve() expand stack at first and receive to
> > check vm_enough_memory() properly. then, too long argument of
> > execve() than the machine memory return EFAULT properly.
>
> This is horrible. We don't want to walk the arguments one more time
> just for this. Let's just improve the checks that we do as we go
> along.
>
> Linus
Okey. I'll consider new way in this night.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-14 1:52 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14 1:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: kosaki.motohiro, Roland McGrath, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> > After this patch, execve() expand stack at first and receive to
> > check vm_enough_memory() properly. then, too long argument of
> > execve() than the machine memory return EFAULT properly.
>
> This is horrible. We don't want to walk the arguments one more time
> just for this. Let's just improve the checks that we do as we go
> along.
>
> Linus
Okey. I'll consider new way in this night.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 2/2] execve: check the VM has enough memory at first
2010-09-14 1:52 ` KOSAKI Motohiro
@ 2010-09-16 5:51 ` KOSAKI Motohiro
-1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16 5:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: kosaki.motohiro, Roland McGrath, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> > On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > >
> > > After this patch, execve() expand stack at first and receive to
> > > check vm_enough_memory() properly. then, too long argument of
> > > execve() than the machine memory return EFAULT properly.
> >
> > This is horrible. We don't want to walk the arguments one more time
> > just for this. Let's just improve the checks that we do as we go
> > along.
> >
> > Linus
>
> Okey. I'll consider new way in this night.
After while thinking, I decided to just drop this idea. because
1) If one pass check is must, we can't reuse vm-overcommit check.
2) Glibc has the duplicated hueristic, then we can't change it nor
introduce new hard limit. (Sh*t)
3) This is not must fix, it only mitigate a pain when accidental large
argv case. Only OOM fixes enough care intended attack case.
4) distro can change default of rlim_max of RLIMIT_STACK. It protect
from RLIM_INFINITY smash.
Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-16 5:51 ` KOSAKI Motohiro
0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16 5:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: kosaki.motohiro, Roland McGrath, Andrew Morton, linux-kernel,
oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
Neil Horman, linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
> > On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > >
> > > After this patch, execve() expand stack at first and receive to
> > > check vm_enough_memory() properly. then, too long argument of
> > > execve() than the machine memory return EFAULT properly.
> >
> > This is horrible. We don't want to walk the arguments one more time
> > just for this. Let's just improve the checks that we do as we go
> > along.
> >
> > Linus
>
> Okey. I'll consider new way in this night.
After while thinking, I decided to just drop this idea. because
1) If one pass check is must, we can't reuse vm-overcommit check.
2) Glibc has the duplicated hueristic, then we can't change it nor
introduce new hard limit. (Sh*t)
3) This is not must fix, it only mitigate a pain when accidental large
argv case. Only OOM fixes enough care intended attack case.
4) distro can change default of rlim_max of RLIMIT_STACK. It protect
from RLIM_INFINITY smash.
Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 2/2] execve: check the VM has enough memory at first
2010-09-16 5:51 ` KOSAKI Motohiro
@ 2010-09-16 15:01 ` Linus Torvalds
-1 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2010-09-16 15:01 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Roland McGrath, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
2010/9/15 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>
> Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.
Well, I mostly agree. That said, I do think we could extend the
limiter some ways.
For example, I think the "stack limit / 4" is perfectly sane, but it
would make total sense to perhaps also take into account the AS and
RSS limits.
And I do think that your attempt to use __vm_enough_memory() was good.
It happens to be coded in a way that makes it useless for a one-pass
model, and some of what it does would be too expensive to do up-front
when you can't short-circuit it, but I do think that it would probably
be appropriate to at least try to take the _rough_ code there and use
it as a limit for maximum stack size too.
For example, we could have a function somewhat like
unsigned long max_stack_size(void)
{
unsigned long allowed, used, limit;
switch (sysctl_overcommit_memory) {
case OVERCOMMIT_ALWAYS:
allowed = ULONG_MAX;
break;
case OVERCOMMIT_GUESS:
.. maybe we can come up with some upper bound here too ..
break;
default:
allowed = (totalram_pages - hugetlb_total_pages())
* sysctl_overcommit_ratio / 100;
if (!cap_sys_admin)
allowed -= allowed / 32;
allowed += total_swap_pages;
/* Don't let a single process grow too big:
leave 3% of the size of this process for other processes */
if (mm)
allowed -= mm->total_vm / 32;
/* What is already committed to? */
used = percpu_counter_read_positive(&vm_committed_as);
if (used > allowed)
return 0;
allowed -= used;
break;
}
limit = ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4;
if (allowed > limit)
allowed = limit;
return allowed;
}
which we'd call once at the beginning of the execve(), and then
remember that result and use it instead of the current 'rlimit/4'
value.
Now, admittedly the OVERCOMMIT_GUESS case is the interesting one, and
the one that is hard to write efficiently. But maybe we could make
'nr_free_pages()' cheap enough that doin that whole OVERCOMMIT_GUESS
"approximate free pages" thing from __vm_enough_memory would work out
too?
I dunno. It doesn't look hopeless.
Linus
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-16 15:01 ` Linus Torvalds
0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2010-09-16 15:01 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Roland McGrath, Andrew Morton, linux-kernel, oss-security,
Solar Designer, Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
linux-fsdevel, pageexec,
Brad Spengler <spender@grsecurity.net>,
Eugene Teo, KAMEZAWA Hiroyuki
2010/9/15 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>
> Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.
Well, I mostly agree. That said, I do think we could extend the
limiter some ways.
For example, I think the "stack limit / 4" is perfectly sane, but it
would make total sense to perhaps also take into account the AS and
RSS limits.
And I do think that your attempt to use __vm_enough_memory() was good.
It happens to be coded in a way that makes it useless for a one-pass
model, and some of what it does would be too expensive to do up-front
when you can't short-circuit it, but I do think that it would probably
be appropriate to at least try to take the _rough_ code there and use
it as a limit for maximum stack size too.
For example, we could have a function somewhat like
unsigned long max_stack_size(void)
{
unsigned long allowed, used, limit;
switch (sysctl_overcommit_memory) {
case OVERCOMMIT_ALWAYS:
allowed = ULONG_MAX;
break;
case OVERCOMMIT_GUESS:
.. maybe we can come up with some upper bound here too ..
break;
default:
allowed = (totalram_pages - hugetlb_total_pages())
* sysctl_overcommit_ratio / 100;
if (!cap_sys_admin)
allowed -= allowed / 32;
allowed += total_swap_pages;
/* Don't let a single process grow too big:
leave 3% of the size of this process for other processes */
if (mm)
allowed -= mm->total_vm / 32;
/* What is already committed to? */
used = percpu_counter_read_positive(&vm_committed_as);
if (used > allowed)
return 0;
allowed -= used;
break;
}
limit = ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4;
if (allowed > limit)
allowed = limit;
return allowed;
}
which we'd call once at the beginning of the execve(), and then
remember that result and use it instead of the current 'rlimit/4'
value.
Now, admittedly the OVERCOMMIT_GUESS case is the interesting one, and
the one that is hard to write efficiently. But maybe we could make
'nr_free_pages()' cheap enough that doin that whole OVERCOMMIT_GUESS
"approximate free pages" thing from __vm_enough_memory would work out
too?
I dunno. It doesn't look hopeless.
Linus
^ permalink raw reply [flat|nested] 81+ messages in thread