[PATCH] exec argument expansion can inappropriately trigger OOM-killer

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-27 22:02 Kees Cook
  2010-08-30  0:19 ` KOSAKI Motohiro
  2010-08-30  0:56 ` Roland McGrath
  0 siblings, 2 replies; 81+ messages in thread
From: Kees Cook @ 2010-08-27 22:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: oss-security, Al Viro, Andrew Morton, Oleg Nesterov,
	KOSAKI Motohiro, Neil Horman, Roland McGrath, linux-fsdevel

Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c

The recent changes to create a stack guard page helps slightly to
discourage this attack, but it is not sufficient. Compiling it statically
moves the libraries out of the way, allowing the stack VMA to fill the
entire TASK_SIZE.

There are two issues:
 1) the OOM killer doesn't notice this argv memory explosion
 2) the argv expansion does not check if rlim[RLIMIT_STACK].rlim_cur is -1.

I figure a quick solution for #2 would be the following patch. However,
running multiple copies of this program could result in similar OOM
behavior, so issue #1 still needs a solution.

Reported-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Kees Cook <kees.cook@canonical.com>
---
 fs/exec.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index dab85ec..be40063 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -194,7 +194,8 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		 *    to work from.
 		 */
 		rlim = current->signal->rlim;
-		if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) {
+		if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4 ||
+		    size > TASK_SIZE / 4) {
 			put_page(page);
 			return NULL;
 		}
-- 
1.7.1

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
  2010-08-27 22:02 [PATCH] exec argument expansion can inappropriately trigger OOM-killer Kees Cook
@ 2010-08-30  0:19 ` KOSAKI Motohiro
  2010-08-30  0:56 ` Roland McGrath
  1 sibling, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-08-30  0:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: kosaki.motohiro, linux-kernel, oss-security, Al Viro,
	Andrew Morton, Oleg Nesterov, Neil Horman, Roland McGrath,
	linux-fsdevel

> Brad Spengler published a local memory-allocation DoS that
> evades the OOM-killer (though not the virtual memory RLIMIT):
> http://www.grsecurity.net/~spender/64bit_dos.c
> 
> The recent changes to create a stack guard page helps slightly to
> discourage this attack, but it is not sufficient. Compiling it statically
> moves the libraries out of the way, allowing the stack VMA to fill the
> entire TASK_SIZE.
> 
> There are two issues:
>  1) the OOM killer doesn't notice this argv memory explosion
>  2) the argv expansion does not check if rlim[RLIMIT_STACK].rlim_cur is -1.
> 
> I figure a quick solution for #2 would be the following patch. However,
> running multiple copies of this program could result in similar OOM
> behavior, so issue #1 still needs a solution.
>
>Reported-by: Brad Spengler <spender@grsecurity.net>
>Signed-off-by: Kees Cook <kees.cook@canonical.com>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


And, I have a patch for #1. Can you please see this? Alternative idea
is to change rss accounting itself.



>From d4e114e5d31b14ebfc399d4b1fb142c7dfce0ca4 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Thu, 19 Aug 2010 20:40:20 +0900
Subject: [PATCH] oom: don't ignore temporary rss while execve

execve() makes new mm struct and setup stack and argv vector,
Unfortunately this new mm is not pointed any tasks, then oom-kill
can't detect this memory usage. therefore oom-kill may kill incorrect
task.

This patch added in-exec rss treatness to oom.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/compat.c             |    8 ++++++--
 fs/exec.c               |   19 +++++++++++++++++--
 include/linux/binfmts.h |    1 +
 include/linux/sched.h   |    1 +
 mm/oom_kill.c           |   36 +++++++++++++++++++++++++++---------
 5 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 718c706..643140c 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1527,6 +1527,7 @@ int compat_do_execve(char * filename,
 	retval = bprm_mm_init(bprm);
 	if (retval)
 		goto out_file;
+	set_exec_mm(bprm->mm);
 
 	bprm->argc = compat_count(argv, MAX_ARG_STRINGS);
 	if ((retval = bprm->argc) < 0)
@@ -1560,6 +1561,7 @@ int compat_do_execve(char * filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	set_exec_mm(NULL);
 	acct_update_integrals(current);
 	free_bprm(bprm);
 	if (displaced)
@@ -1567,8 +1569,10 @@ int compat_do_execve(char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
-		mmput(bprm->mm);
+	if (current->in_exec_mm) {
+		struct mm_struct *in_exec_mm = set_exec_mm(NULL);
+		mmput (in_exec_mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..85192e1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1314,6 +1314,17 @@ int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
 
 EXPORT_SYMBOL(search_binary_handler);
 
+struct mm_struct* set_exec_mm(struct mm_struct *mm)
+{
+	struct mm_struct *old = current->in_exec_mm;
+
+	task_lock(current);
+	current->in_exec_mm = mm;
+	task_unlock(current);
+
+	return old;
+}
+
 /*
  * sys_execve() executes a new program.
  */
@@ -1361,6 +1372,7 @@ int do_execve(const char * filename,
 	retval = bprm_mm_init(bprm);
 	if (retval)
 		goto out_file;
+	set_exec_mm(bprm->mm);
 
 	bprm->argc = count(argv, MAX_ARG_STRINGS);
 	if ((retval = bprm->argc) < 0)
@@ -1395,6 +1407,7 @@ int do_execve(const char * filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	set_exec_mm(NULL);
 	acct_update_integrals(current);
 	free_bprm(bprm);
 	if (displaced)
@@ -1402,8 +1415,10 @@ int do_execve(const char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
-		mmput (bprm->mm);
+	if (current->in_exec_mm) {
+		struct mm_struct *in_exec_mm = set_exec_mm(NULL);
+		mmput (in_exec_mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a065612..8cf61eb 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -133,6 +133,7 @@ extern void install_exec_creds(struct linux_binprm *bprm);
 extern void do_coredump(long signr, int exit_code, struct pt_regs *regs);
 extern void set_binfmt(struct linux_binfmt *new);
 extern void free_bprm(struct linux_binprm *);
+extern struct mm_struct* set_exec_mm(struct mm_struct *mm);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..d413757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1217,6 +1217,7 @@ struct task_struct {
 	struct plist_node pushable_tasks;
 
 	struct mm_struct *mm, *active_mm;
+	struct mm_struct *in_exec_mm;
 #if defined(SPLIT_RSS_COUNTING)
 	struct task_rss_stat	rss_stat;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 57c05f7..7fc6916 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -120,6 +120,30 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 	return NULL;
 }
 
+static unsigned long calculate_rss_swap(struct task_struct *p)
+{
+	struct task_struct *t = p;
+	int mm_accounted = 0;
+	unsigned long points = 0;
+
+	do {
+		task_lock(t);
+		if (!mm_accounted && t->mm) {
+			points += get_mm_rss(t->mm);
+			points += get_mm_counter(t->mm, MM_SWAPENTS);
+			mm_accounted = 1;
+		}
+		if (t->in_exec_mm) {
+			points += get_mm_rss(t->in_exec_mm);
+			points += get_mm_counter(t->in_exec_mm, MM_SWAPENTS);
+		}
+		task_unlock(t);
+	} while_each_thread(p, t);
+
+	return points;
+}
+
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
 			   const nodemask_t *nodemask)
@@ -157,16 +181,11 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	if (oom_unkillable_task(p, mem, nodemask))
 		return 0;
 
-	p = find_lock_task_mm(p);
-	if (!p)
-		return 0;
-
 	/*
 	 * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
 	 * need to be executed for something that cannot be killed.
 	 */
 	if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-		task_unlock(p);
 		return 0;
 	}
 
@@ -175,7 +194,6 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 * priority for oom killing.
 	 */
 	if (p->flags & PF_OOM_ORIGIN) {
-		task_unlock(p);
 		return 1000;
 	}
 
@@ -190,9 +208,9 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss and swap space use.
 	 */
-	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
-			totalpages;
-	task_unlock(p);
+	points = calculate_rss_swap(p) * 1000 / totalpages;
+	if (!points)
+		return 0;
 
 	/*
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-- 
1.6.5.2






^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
  2010-08-27 22:02 [PATCH] exec argument expansion can inappropriately trigger OOM-killer Kees Cook
  2010-08-30  0:19 ` KOSAKI Motohiro
@ 2010-08-30  0:56 ` Roland McGrath
  2010-08-30  3:23     ` Solar Designer
  1 sibling, 1 reply; 81+ messages in thread
From: Roland McGrath @ 2010-08-30  0:56 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, oss-security, Al Viro, Andrew Morton,
	Oleg Nesterov, KOSAKI Motohiro, Neil Horman, linux-fsdevel

IMHO unlimited should mean unlimited.  So, on that score, I'd leave this
constraint out and just say whatever deficiencies in the OOM killer (or in
whatever should make a manifestly too-large allocation get ENOMEM) should
just be fixed separately.

But that aside, I'll just consider the intent stated in the comment in
get_arg_page:
		 * Limit to 1/4-th the stack size for the argv+env strings.
		 * This ensures that:
		 *  - the remaining binfmt code will not run out of stack space,
		 *  - the program will have a reasonable amount of stack left
		 *    to work from.
To effect "1/4th the stack size", a cap at TASK_SIZE/4 does make some sense,
since TASK_SIZE is less than RLIM_INFINITY even in the pure 32-bit world,
and that is the true theoretical limit on stack size.

The trouble here, both for that stated intent, and for this "exploit",
is which TASK_SIZE that is on a biarch machine.  In fact, it's the
TASK_SIZE of the process that called execve.  (get_arg_page is called
from copy_strings, from do_execve before search_binary_handler--i.e.,
before anything has looked at the file to decide whether it's going to
be a 32-bit or 64-bit task on exec.)  If it's a 32-bit process exec'ing
a 64-bit program, it's the 32-bit TASK_SIZE (perhaps as little as 3GB).
So that's a limit of 0.75GB on a 64-bit program, which might actually do
just fine with 2 or 3GB.  If it's a 64-bit process exec'ing a 32-bit
program, it's the 64-bit TASK_SIZE (128TB on x86-64).  So that's a limit
of 32TB, which is perhaps not that helpfully less than 2PB minus 1 byte
(RLIM_INFINITY/4) as far as preventing any over-allocation DoS in practice.

So IMHO your change does marginal harm in some cases (32 execs 64)
and makes no appreciable difference to anyone interested in malice
(who can just dodge by exploiting it via 64 execs 64 or 64 execs 32).

If you want to constrain it this way, it's probably simpler just to use
a smaller hard limit for RLIM_STACK at boot time (and hence system-wide).
But it sounds like all you really need is to fix the OOM/allocation
behavior for huge stack allocations.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-30  3:23     ` Solar Designer
  0 siblings, 0 replies; 81+ messages in thread
From: Solar Designer @ 2010-08-30  3:23 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Kees Cook, linux-kernel, oss-security, Al Viro, Andrew Morton,
	Oleg Nesterov, KOSAKI Motohiro, Neil Horman, linux-fsdevel

On Sun, Aug 29, 2010 at 05:56:48PM -0700, Roland McGrath wrote:
> IMHO unlimited should mean unlimited.

In general, I'd agree, however let's recall that back in 2.2 days we
introduced strnlen_user() and the "max" argument to count() to prevent a
userspace program from making the kernel loop busily for too long.
IIRC, prior to that fix, I was able to cause the kernel to loop for tens
of minutes in a single execve() call on an Alpha with 128 MB RAM, by
using repeated mappings of the same pages (almost 200 GB total).

Now it appears that, besides the issue that started this thread, the
same problem I mentioned above got re-introduced.  We still have
strnlen_user() and the "max" argument to count(), but we no longer have
hard limits for "max".  Someone set MAX_ARG_STRINGS to 0x7FFFFFFF, and
this is just too much.  MAX_ARG_STRLEN is set to 32 pages, and these two
combined allow a userspace program to make the kernel loop for days.

So I think that we should re-introduce some artificial limit(s), maybe
adjustable by root (by the host system's real root only when container
virtualization is involved).  Maybe we should lower MAX_ARG_STRINGS
and/or maybe we should limit the portion of stack space usable for argv
to, say, 0.75 GB (or even less).

> So, on that score, I'd leave this
> constraint out and just say whatever deficiencies in the OOM killer (or in
> whatever should make a manifestly too-large allocation get ENOMEM) should
> just be fixed separately.

I think the "OOM killer problem" should be fixed too.

> But that aside, I'll just consider the intent stated in the comment in
> get_arg_page:
> 		 * Limit to 1/4-th the stack size for the argv+env strings.
> 		 * This ensures that:
> 		 *  - the remaining binfmt code will not run out of stack space,
> 		 *  - the program will have a reasonable amount of stack left
> 		 *    to work from.
> To effect "1/4th the stack size", a cap at TASK_SIZE/4 does make some sense,
> since TASK_SIZE is less than RLIM_INFINITY even in the pure 32-bit world,
> and that is the true theoretical limit on stack size.
> 
> The trouble here, both for that stated intent, and for this "exploit",
> is which TASK_SIZE that is on a biarch machine.  In fact, it's the
> TASK_SIZE of the process that called execve.  (get_arg_page is called
> from copy_strings, from do_execve before search_binary_handler--i.e.,
> before anything has looked at the file to decide whether it's going to
> be a 32-bit or 64-bit task on exec.)  If it's a 32-bit process exec'ing
> a 64-bit program, it's the 32-bit TASK_SIZE (perhaps as little as 3GB).
> So that's a limit of 0.75GB on a 64-bit program, which might actually do
> just fine with 2 or 3GB.  If it's a 64-bit process exec'ing a 32-bit
> program, it's the 64-bit TASK_SIZE (128TB on x86-64).  So that's a limit
> of 32TB, which is perhaps not that helpfully less than 2PB minus 1 byte
> (RLIM_INFINITY/4) as far as preventing any over-allocation DoS in practice.

That's a very good point!

> If you want to constrain it this way, it's probably simpler just to use
> a smaller hard limit for RLIM_STACK at boot time (and hence system-wide).

Right.

> But it sounds like all you really need is to fix the OOM/allocation
> behavior for huge stack allocations.

No, we need both.

Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
stack tries to expand around the address space when shifted".  Perhaps
limiting the stack size would deal with that, but maybe the "bug" needs
to be patched elsewhere as well.  grsecurity has the following hunk:

@@ -505,7 +520,8 @@ static int shift_arg_pages(struct vm_are
 	unsigned long new_end = old_end - shift;
 	struct mmu_gather *tlb;

-	BUG_ON(new_start > new_end);
+	if (new_start >= new_end || new_start < mmap_min_addr)
+		return -EFAULT;

 	/*
 	 * ensure there are no vmas between where we want to go

which is likely part of a fix (but not the entire fix) for what the
comment in 64bit_dos.c refers to.  However, I was not able to trigger
the BUG_ON() on RHEL5'ish kernels by simply running the 64bit_dos
program (64-bit to 32-bit exec) on two systems, one with 2 GB RAM, the
other with 4 GB.  Of course, I set "ulimit -s unlimited" first.

On the 2 GB system, I saw the OOM killer problem (several other
processes got killed before 64bit_dos was killed as well).  On the 4 GB
system, exec succeeded (after looping in the kernel for a few seconds,
and then the newly started program failed because of its address space
exhaustion).  Maybe the BUG is only triggerable on certain other kernel
versions, or maybe I didn't try hard enough (I certainly did not try
very hard - I did not review the code closely).

Someone could want to look into this aspect as well.

grsecurity's added check against mmap_min_addr (and the reference to it
in the "exploit's" comment) is also very curious.  It can be just a way
to avoid triggering another BUG() elsewhere, or it can be just an extra
hardening measure - but it could also be "for real".  We could want to
double-check that there wasn't an mmap_min_addr bypass possible here.

Overall, there are multiple issues here (maybe up to four?) and multiple
things to review the code for.

Does anyone (with more time than me) want to look into these for real?
(I sure hope so.)

Thanks,

Alexander

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-30  3:23     ` Solar Designer
  0 siblings, 0 replies; 81+ messages in thread
From: Solar Designer @ 2010-08-30  3:23 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Kees Cook, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Sun, Aug 29, 2010 at 05:56:48PM -0700, Roland McGrath wrote:
> IMHO unlimited should mean unlimited.

In general, I'd agree, however let's recall that back in 2.2 days we
introduced strnlen_user() and the "max" argument to count() to prevent a
userspace program from making the kernel loop busily for too long.
IIRC, prior to that fix, I was able to cause the kernel to loop for tens
of minutes in a single execve() call on an Alpha with 128 MB RAM, by
using repeated mappings of the same pages (almost 200 GB total).

Now it appears that, besides the issue that started this thread, the
same problem I mentioned above got re-introduced.  We still have
strnlen_user() and the "max" argument to count(), but we no longer have
hard limits for "max".  Someone set MAX_ARG_STRINGS to 0x7FFFFFFF, and
this is just too much.  MAX_ARG_STRLEN is set to 32 pages, and these two
combined allow a userspace program to make the kernel loop for days.

So I think that we should re-introduce some artificial limit(s), maybe
adjustable by root (by the host system's real root only when container
virtualization is involved).  Maybe we should lower MAX_ARG_STRINGS
and/or maybe we should limit the portion of stack space usable for argv
to, say, 0.75 GB (or even less).

> So, on that score, I'd leave this
> constraint out and just say whatever deficiencies in the OOM killer (or in
> whatever should make a manifestly too-large allocation get ENOMEM) should
> just be fixed separately.

I think the "OOM killer problem" should be fixed too.

> But that aside, I'll just consider the intent stated in the comment in
> get_arg_page:
> 		 * Limit to 1/4-th the stack size for the argv+env strings.
> 		 * This ensures that:
> 		 *  - the remaining binfmt code will not run out of stack space,
> 		 *  - the program will have a reasonable amount of stack left
> 		 *    to work from.
> To effect "1/4th the stack size", a cap at TASK_SIZE/4 does make some sense,
> since TASK_SIZE is less than RLIM_INFINITY even in the pure 32-bit world,
> and that is the true theoretical limit on stack size.
> 
> The trouble here, both for that stated intent, and for this "exploit",
> is which TASK_SIZE that is on a biarch machine.  In fact, it's the
> TASK_SIZE of the process that called execve.  (get_arg_page is called
> from copy_strings, from do_execve before search_binary_handler--i.e.,
> before anything has looked at the file to decide whether it's going to
> be a 32-bit or 64-bit task on exec.)  If it's a 32-bit process exec'ing
> a 64-bit program, it's the 32-bit TASK_SIZE (perhaps as little as 3GB).
> So that's a limit of 0.75GB on a 64-bit program, which might actually do
> just fine with 2 or 3GB.  If it's a 64-bit process exec'ing a 32-bit
> program, it's the 64-bit TASK_SIZE (128TB on x86-64).  So that's a limit
> of 32TB, which is perhaps not that helpfully less than 2PB minus 1 byte
> (RLIM_INFINITY/4) as far as preventing any over-allocation DoS in practice.

That's a very good point!

> If you want to constrain it this way, it's probably simpler just to use
> a smaller hard limit for RLIM_STACK at boot time (and hence system-wide).

Right.

> But it sounds like all you really need is to fix the OOM/allocation
> behavior for huge stack allocations.

No, we need both.

Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
stack tries to expand around the address space when shifted".  Perhaps
limiting the stack size would deal with that, but maybe the "bug" needs
to be patched elsewhere as well.  grsecurity has the following hunk:

@@ -505,7 +520,8 @@ static int shift_arg_pages(struct vm_are
 	unsigned long new_end = old_end - shift;
 	struct mmu_gather *tlb;

-	BUG_ON(new_start > new_end);
+	if (new_start >= new_end || new_start < mmap_min_addr)
+		return -EFAULT;

 	/*
 	 * ensure there are no vmas between where we want to go

which is likely part of a fix (but not the entire fix) for what the
comment in 64bit_dos.c refers to.  However, I was not able to trigger
the BUG_ON() on RHEL5'ish kernels by simply running the 64bit_dos
program (64-bit to 32-bit exec) on two systems, one with 2 GB RAM, the
other with 4 GB.  Of course, I set "ulimit -s unlimited" first.

On the 2 GB system, I saw the OOM killer problem (several other
processes got killed before 64bit_dos was killed as well).  On the 4 GB
system, exec succeeded (after looping in the kernel for a few seconds,
and then the newly started program failed because of its address space
exhaustion).  Maybe the BUG is only triggerable on certain other kernel
versions, or maybe I didn't try hard enough (I certainly did not try
very hard - I did not review the code closely).

Someone could want to look into this aspect as well.

grsecurity's added check against mmap_min_addr (and the reference to it
in the "exploit's" comment) is also very curious.  It can be just a way
to avoid triggering another BUG() elsewhere, or it can be just an extra
hardening measure - but it could also be "for real".  We could want to
double-check that there wasn't an mmap_min_addr bypass possible here.

Overall, there are multiple issues here (maybe up to four?) and multiple
things to review the code for.

Does anyone (with more time than me) want to look into these for real?
(I sure hope so.)

Thanks,

Alexander

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
  2010-08-30  3:23     ` Solar Designer
  (?)
@ 2010-08-30 10:06     ` Roland McGrath
  2010-08-30 19:48       ` Solar Designer
  2010-09-08  2:34         ` Roland McGrath
  -1 siblings, 2 replies; 81+ messages in thread
From: Roland McGrath @ 2010-08-30 10:06 UTC (permalink / raw)
  To: Solar Designer
  Cc: Kees Cook, linux-kernel, oss-security, Al Viro, Andrew Morton,
	Oleg Nesterov, KOSAKI Motohiro, Neil Horman, linux-fsdevel

> IIRC, prior to that fix, I was able to cause the kernel to loop for tens
> of minutes in a single execve() call on an Alpha with 128 MB RAM, by
> using repeated mappings of the same pages (almost 200 GB total).

And I say, if your userland process could really allocate another 200GB,
then more power to you, you can do it with an exec too.  If you could do
the same with a userland stack allocation, and spend all that time in
strlen calls and then memcpy, you can do it inside execve too.  If it
takes days, that's what you asked for, and it's your process.  It just
ought to be every bit (or near enough) as preemptible and interruptible
as that normal userland activity ought to be.

So, perhaps we want this (count already has a cond_resched in its loop):

diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..0000000 100644  
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -369,6 +369,9 @@ static int count(const char __user * con
 		for (;;) {
 			const char __user * p;

+			if (signal_pending(current))
+				return -ERESTARTNOINTR;
+
 			if (get_user(p, argv))
 				return -EFAULT;
 			if (!p)
@@ -400,6 +403,10 @@ static int copy_strings(int argc, const 
 		int len;
 		unsigned long pos;

+		if (signal_pending(current))
+			return -ERESTARTNOINTR;
+		cond_resched();
+
 		if (get_user(str, argv+argc) ||
 				!(len = strnlen_user(str, MAX_ARG_STRLEN))) {
 			ret = -EFAULT;

> Now it appears that, besides the issue that started this thread, the
> same problem I mentioned above got re-introduced.  We still have
> strnlen_user() and the "max" argument to count(), but we no longer have
> hard limits for "max".  Someone set MAX_ARG_STRINGS to 0x7FFFFFFF, and
> this is just too much.  MAX_ARG_STRLEN is set to 32 pages, and these two
> combined allow a userspace program to make the kernel loop for days.

I really don't think we need that stuff back.  I think we can get rid of
it and fix the real problems, and be happier overall.

> > But it sounds like all you really need is to fix the OOM/allocation
> > behavior for huge stack allocations.
> 
> No, we need both.

I don't agree.  If all of the implicit allocation done inside execve is
accounted and controlled as well as normal userland allocations so that
the execve fails when userland allocation would fail, then there is no
reason for special-case arbitrary limits.

> Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
> stack tries to expand around the address space when shifted".  Perhaps
> limiting the stack size would deal with that, but maybe the "bug" needs
> to be patched elsewhere as well.  grsecurity has the following hunk:

That change seems like it might be reasonable, but I haven't really
looked at shift_arg_pages before.  Has someone reported this BUG_ON
failure mode with a reproducer?

> Overall, there are multiple issues here (maybe up to four?) and multiple
> things to review the code for.

Agreed.  But IMHO the missing arbitrary limits on arg/env size are not
among them.  I don't know about shift_arg_pages.  The core fix I think
makes sense is making the nascent mm get accounted to the user process
normally.  Rather than better enabling OOM killing, I think what really
makes sense is for the nascent mm to be marked such that allocations in
it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) just
fail with ENOMEM before it resorts to the OOM killer (or perhaps even to
very aggressive pageout).  That should percolate back to the execve just
failing with ENOMEM, which is nicer than OOM kill even if the OOM killer
actually does pick exactly and only the right target.

Thanks,
Roland

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-30 17:49       ` Solar Designer
  0 siblings, 0 replies; 81+ messages in thread
From: Solar Designer @ 2010-08-30 17:49 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Kees Cook, linux-kernel, oss-security, Al Viro, Andrew Morton,
	Oleg Nesterov, KOSAKI Motohiro, Neil Horman, linux-fsdevel

On Mon, Aug 30, 2010 at 07:23:31AM +0400, Solar Designer wrote:
> Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
> stack tries to expand around the address space when shifted".
[...]
> which is likely part of a fix (but not the entire fix) for what the
> comment in 64bit_dos.c refers to.  However, I was not able to trigger
> the BUG_ON() on RHEL5'ish kernels by simply running the 64bit_dos
> program (64-bit to 32-bit exec) on two systems, one with 2 GB RAM, the
> other with 4 GB.  Of course, I set "ulimit -s unlimited" first.

I am finally able to trigger the BUG by replacing "/bin/sh" with
"/bin/false" in 64bit_dos.c, relying on our library-free implementation
of /bin/false on Owl:

owl!solar:~$ objdump -d /bin/false

/bin/false:     file format elf32-i386

Disassembly of section .text:

08048074 <.text>:
 8048074:       b8 01 00 00 00          mov    $0x1,%eax
 8048079:       bb 01 00 00 00          mov    $0x1,%ebx
 804807e:       cd 80                   int    $0x80

owl!solar:~$ file 64bit_dos
64bit_dos: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, statically linked, for GNU/Linux 2.4.0, stripped

(the "exploit" is statically-linked since I brought it from another
machine, I don't think this matters).

After looping in the kernel for about 10 seconds, I got:

Kernel BUG at fs/exec.c:535
[...]
Call Trace:
 [<ffffffff8007c44c>] load_elf32_binary+0x7f9/0x1702
 [<ffffffff800c193f>] expand_stack+0x7f/0xad
 [<ffffffff8003e39b>] search_binary_handler+0x94/0x1e2
 [<ffffffff8003da2f>] do_execve+0x18e/0x1f2
 [<ffffffff800542cc>] sys_execve+0x34/0x51
 [<ffffffff8005e523>] stub_execve+0x67/0xb0
[...]
Code: 0f 0b 68 fe 93 3e 80 c2 17 02 48 8b 7c 24 08 4c 89 fe e8 da 
RIP  [<ffffffff8002dc90>] setup_arg_pages+0x151/0x2d3

The kernel is a revision and custom build of 2.6.18-128.2.1.el5
(whatever I happened to readily have installed on a test system).
I think the problem should be reproducible with current RHEL5 kernels
and likely with latest mainstream kernels as well.

The process is stuck:

solar    28754  2.8 77.8 3142276 3142276 pts/0 D+   10:34   0:13 [false]

(uninterruptible)

3 GB memory is still taken.

Alexander

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-30 17:49       ` Solar Designer
  0 siblings, 0 replies; 81+ messages in thread
From: Solar Designer @ 2010-08-30 17:49 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Kees Cook, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 30, 2010 at 07:23:31AM +0400, Solar Designer wrote:
> Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
> stack tries to expand around the address space when shifted".
[...]
> which is likely part of a fix (but not the entire fix) for what the
> comment in 64bit_dos.c refers to.  However, I was not able to trigger
> the BUG_ON() on RHEL5'ish kernels by simply running the 64bit_dos
> program (64-bit to 32-bit exec) on two systems, one with 2 GB RAM, the
> other with 4 GB.  Of course, I set "ulimit -s unlimited" first.

I am finally able to trigger the BUG by replacing "/bin/sh" with
"/bin/false" in 64bit_dos.c, relying on our library-free implementation
of /bin/false on Owl:

owl!solar:~$ objdump -d /bin/false

/bin/false:     file format elf32-i386

Disassembly of section .text:

08048074 <.text>:
 8048074:       b8 01 00 00 00          mov    $0x1,%eax
 8048079:       bb 01 00 00 00          mov    $0x1,%ebx
 804807e:       cd 80                   int    $0x80

owl!solar:~$ file 64bit_dos
64bit_dos: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, statically linked, for GNU/Linux 2.4.0, stripped

(the "exploit" is statically-linked since I brought it from another
machine, I don't think this matters).

After looping in the kernel for about 10 seconds, I got:

Kernel BUG at fs/exec.c:535
[...]
Call Trace:
 [<ffffffff8007c44c>] load_elf32_binary+0x7f9/0x1702
 [<ffffffff800c193f>] expand_stack+0x7f/0xad
 [<ffffffff8003e39b>] search_binary_handler+0x94/0x1e2
 [<ffffffff8003da2f>] do_execve+0x18e/0x1f2
 [<ffffffff800542cc>] sys_execve+0x34/0x51
 [<ffffffff8005e523>] stub_execve+0x67/0xb0
[...]
Code: 0f 0b 68 fe 93 3e 80 c2 17 02 48 8b 7c 24 08 4c 89 fe e8 da 
RIP  [<ffffffff8002dc90>] setup_arg_pages+0x151/0x2d3

The kernel is a revision and custom build of 2.6.18-128.2.1.el5
(whatever I happened to readily have installed on a test system).
I think the problem should be reproducible with current RHEL5 kernels
and likely with latest mainstream kernels as well.

The process is stuck:

solar    28754  2.8 77.8 3142276 3142276 pts/0 D+   10:34   0:13 [false]

(uninterruptible)

3 GB memory is still taken.

Alexander

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
  2010-08-30 10:06     ` Roland McGrath
@ 2010-08-30 19:48       ` Solar Designer
  2010-08-31  0:40         ` Roland McGrath
  2010-09-08  2:34         ` Roland McGrath
  1 sibling, 1 reply; 81+ messages in thread
From: Solar Designer @ 2010-08-30 19:48 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Kees Cook, linux-kernel, oss-security, Al Viro, Andrew Morton,
	Oleg Nesterov, KOSAKI Motohiro, Neil Horman, linux-fsdevel

On Mon, Aug 30, 2010 at 03:06:16AM -0700, Roland McGrath wrote:
> And I say, if your userland process could really allocate another 200GB,
> then more power to you, you can do it with an exec too.  If you could do
> the same with a userland stack allocation, and spend all that time in
> strlen calls and then memcpy, you can do it inside execve too.  If it
> takes days, that's what you asked for, and it's your process.  It just
> ought to be every bit (or near enough) as preemptible and interruptible
> as that normal userland activity ought to be.

This makes sense to me.  However, introducing a new preemption point
may violate assumptions under which the code was written and reviewed
in the past.  In the worst case, we'd introduce/expose race conditions
allowing for privilege escalation.

> So, perhaps we want this (count already has a cond_resched in its loop):

Good point re: count() already having this (I think it did not in 2.2).

> @@ -400,6 +403,10 @@ static int copy_strings(int argc, const 
>  		int len;
>  		unsigned long pos;
>  
> +		if (signal_pending(current))
> +			return -ERESTARTNOINTR;
> +		cond_resched();

So, in current kernels, you're making it possible for more kinds of
things to change after prepare_binprm() but before
search_binary_handler().  We'd need to check for possible implications
of this.

I must admit I am not familiar with what additional kinds of things may
change when execution is preempted.  This made a significant difference
in some much older kernels (many years ago), but now that the kernel
makes a lot less use of locking most things may be changed by another
CPU even without preemption.  So does anyone have a list of what
additional risks we're exposed to, if any, when we allow preemption in
current kernels?

> Has someone reported this BUG_ON failure mode with a reproducer?

64bit_dos.c was supposed to be the reproducer, and I managed to get it
to work (as I've documented in another message earlier today).  The
prerequisites appeared to be (some of these might be specific to my
tests, though):

- 64-bit kernel with 32-bit userland support (e.g., CONFIG_IA32_EMULATION);
- 64-bit build of 64bit_dos.c;
- 32-bit build of the target program;
- no dynamic linking in the target program;
- "ulimit -s unlimited" before running the reproducer program;
- over 3 GB of RAM in the system.

> [...]  Rather than better enabling OOM killing, I think what really
> makes sense is for the nascent mm to be marked such that allocations in
> it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) just
> fail with ENOMEM before it resorts to the OOM killer (or perhaps even to
> very aggressive pageout).  That should percolate back to the execve just
> failing with ENOMEM, which is nicer than OOM kill even if the OOM killer
> actually does pick exactly and only the right target.

I agree.

Thanks,

Alexander

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-30 22:08         ` Brad Spengler
  0 siblings, 0 replies; 81+ messages in thread
From: Brad Spengler @ 2010-08-30 22:08 UTC (permalink / raw)
  To: Solar Designer
  Cc: Roland McGrath, Kees Cook, linux-kernel, oss-security, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec

[-- Attachment #1: Type: text/plain, Size: 3745 bytes --]

Hi guys,

I see you're having fun with my code ;)  Just wanted to remind you that 
I do exist; I reported this in December 2009 to Ted Tso and again 
recently (forwarded the same email from 2009) to Kees Cook and James 
Morris.  So even though nobody's actually emailed me about the issue(s), 
I am available to answer any questions.  Just CC me on the email as I'm 
not subscribed to the list.

Anyway, I did actually research the bug(s) involved quite a bit around 
the time I reported it, so hopefully some of the below will help.

The bug seems to have been introduced in 2.6.23, see:
http://thread.gmane.org/gmane.linux.ports.hppa/752
http://www.spinics.net/lists/linux-arch/msg01584.html
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg170491.html
though I'm guessing the functionality was also backported to major 
distros

The check using the current stack limit as a byte value (including when 
it's RLIMIT_INFINITY) by dividing it by 4 is completely broken for a 
number of reasons:
1/4th of a 64bit ~0 is several times larger than the size of addressable 
userland address space on most 64bit architectures (amd64 is 47 bits)
1/4th of a 64bit ~0 is way bigger than any 32bit value
No other place in the kernel treats a limit in this way
With a high rlimit and high usage, the code doesn't do what it intends 
to do -- be able to return a meaningful error message to execve, instead 
of having the process doing the execve terminated with a SIGKILL in 
later stages of ELF loading.

Combined with the fact that the max arg size * max number of args 
(~256TB) is also again larger than the 32bit address space and the 
amd64 userland address space (as well as the address space of several other 
64bit architectures), lots of problems appear.  This was exacerbated by 
the behavior that, until recently fixed, allowed the stack to grow over 
any other existing mappings.  Combine this with ASLR and shifting that 
whole stack range down a random amount via shift_arg_pages/adjust_vma 
which skips many of the sanity checks that exist elsewhere, and even 
more problems appear (I think this latter problem may make it possible 
to still trigger the BUG_ON() on patched kernels if a static binary is 
used and ASLR shifts the stack over the binary).

From my research, it's not possible to successfully execute a binary 
such that mmap_min_addr with normal values can be bypassed by the stack 
shifting trick.  I was able to determine the exact number+sizes of 
arguments to use for ASLR to have a chance to shift the stack down to 0 
(any more and we would trigger that BUG_ON()).  Though this was 
successful, after this point in the ELF loader, additional data is set 
up on the stack, proportional to the number of arguments passed.  It's 
impossible for this additional setup to consume less than a page, so it 
triggers a stack expansion which then gets checked against the normal 
mmap_min_addr checks.  What actually ends up happening on this 
second-stage setup is the binary gets killed with SIGKILL by the ELF 
loader.

The fix to the OOM killer looks correct, but the other problem (causing 
extreme interactivity hits with almost no effort in userland) needs some 
more thought, especially since it appears no distro is shipping with 
hard limits on the stack.

If the  "/ 4" check is to be preserved, it needs to take into account 
the personality of the target binary: this way, the exec'ing task should 
always get a proper error back instead of being terminated by the 
kernel.

There shouldn't be any additional risk from adding the extra rescheds, 
as copy_*_user can already sleep and be raced against via a number of 
methods.

-Brad

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-30 22:08         ` Brad Spengler
  0 siblings, 0 replies; 81+ messages in thread
From: Brad Spengler @ 2010-08-30 22:08 UTC (permalink / raw)
  To: Solar Designer
  Cc: Roland McGrath, Kees Cook, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	pageexec-Y8qEzhMunLyT9ig0jae3mg

[-- Attachment #1: Type: text/plain, Size: 3769 bytes --]

Hi guys,

I see you're having fun with my code ;)  Just wanted to remind you that 
I do exist; I reported this in December 2009 to Ted Tso and again 
recently (forwarded the same email from 2009) to Kees Cook and James 
Morris.  So even though nobody's actually emailed me about the issue(s), 
I am available to answer any questions.  Just CC me on the email as I'm 
not subscribed to the list.

Anyway, I did actually research the bug(s) involved quite a bit around 
the time I reported it, so hopefully some of the below will help.

The bug seems to have been introduced in 2.6.23, see:
http://thread.gmane.org/gmane.linux.ports.hppa/752
http://www.spinics.net/lists/linux-arch/msg01584.html
http://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg170491.html
though I'm guessing the functionality was also backported to major 
distros

The check using the current stack limit as a byte value (including when 
it's RLIMIT_INFINITY) by dividing it by 4 is completely broken for a 
number of reasons:
1/4th of a 64bit ~0 is several times larger than the size of addressable 
userland address space on most 64bit architectures (amd64 is 47 bits)
1/4th of a 64bit ~0 is way bigger than any 32bit value
No other place in the kernel treats a limit in this way
With a high rlimit and high usage, the code doesn't do what it intends 
to do -- be able to return a meaningful error message to execve, instead 
of having the process doing the execve terminated with a SIGKILL in 
later stages of ELF loading.

Combined with the fact that the max arg size * max number of args 
(~256TB) is also again larger than the 32bit address space and the 
amd64 userland address space (as well as the address space of several other 
64bit architectures), lots of problems appear.  This was exacerbated by 
the behavior that, until recently fixed, allowed the stack to grow over 
any other existing mappings.  Combine this with ASLR and shifting that 
whole stack range down a random amount via shift_arg_pages/adjust_vma 
which skips many of the sanity checks that exist elsewhere, and even 
more problems appear (I think this latter problem may make it possible 
to still trigger the BUG_ON() on patched kernels if a static binary is 
used and ASLR shifts the stack over the binary).

From my research, it's not possible to successfully execute a binary 
such that mmap_min_addr with normal values can be bypassed by the stack 
shifting trick.  I was able to determine the exact number+sizes of 
arguments to use for ASLR to have a chance to shift the stack down to 0 
(any more and we would trigger that BUG_ON()).  Though this was 
successful, after this point in the ELF loader, additional data is set 
up on the stack, proportional to the number of arguments passed.  It's 
impossible for this additional setup to consume less than a page, so it 
triggers a stack expansion which then gets checked against the normal 
mmap_min_addr checks.  What actually ends up happening on this 
second-stage setup is the binary gets killed with SIGKILL by the ELF 
loader.

The fix to the OOM killer looks correct, but the other problem (causing 
extreme interactivity hits with almost no effort in userland) needs some 
more thought, especially since it appears no distro is shipping with 
hard limits on the stack.

If the  "/ 4" check is to be preserved, it needs to take into account 
the personality of the target binary: this way, the exec'ing task should 
always get a proper error back instead of being terminated by the 
kernel.

There shouldn't be any additional risk from adding the extra rescheds, 
as copy_*_user can already sleep and be raced against via a number of 
methods.

-Brad

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
  2010-08-30 19:48       ` Solar Designer
@ 2010-08-31  0:40         ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-08-31  0:40 UTC (permalink / raw)
  To: Solar Designer
  Cc: Kees Cook, linux-kernel, oss-security, Al Viro, Andrew Morton,
	Oleg Nesterov, KOSAKI Motohiro, Neil Horman, linux-fsdevel,
	Brad Spengler, pageexec

> This makes sense to me.  However, introducing a new preemption point
> may violate assumptions under which the code was written and reviewed
> in the past.  In the worst case, we'd introduce/expose race conditions
> allowing for privilege escalation.

This can't be so.  There are already many possibilities for preemption
in the get_user_pages code paths (called from get_arg_page).

> > So, perhaps we want this (count already has a cond_resched in its loop):
> 
> Good point re: count() already having this (I think it did not in 2.2).

Indeed, this too is a clear indication that preemption here is already safe.

> > +		if (signal_pending(current))
> > +			return -ERESTARTNOINTR;
> > +		cond_resched();
> 
> So, in current kernels, you're making it possible for more kinds of
> things to change after prepare_binprm() but before
> search_binary_handler().  We'd need to check for possible implications
> of this.

What "change"?  Preemption is already possible, that's nothing new.
The only difference is that we might notice TIF_SIGPENDING having been
set, and bail out either before or after prepare_binprm, and so never
call search_binary_handler at all.

The user-visible effect of this could be that a process taking many signals
way too fast could get stuck permanently running its signal handlers and
never actually really attempting the execve (but perhaps doing some mm
setup and arg-copying work and then throwing it away, repeatedly).  Before,
the speed of the signals might have been such that the process could get
into the execve syscall, so it might happen to complete when now it would
be more likely to handle another signal before doing the exec instead.

That, and debuggers might ever see -ERESTARTNOHAND status in syscall
tracing for execve (as they already might for many other syscalls, so they
should be fine).

Note that get_user_pages() already fails for SIGKILL.  So a more
conservative change would be just fatal_signal_pending(current) checks.
That lets only SIGKILL short-circuit the execve to kill the process,
without just letting normal signals interrupt execve like they would
interrupt any other system call.  That should have exactly no
user-visible effects (aside from fairer scheduling in CONFIG_PREEMPT=n
kernels, and better responsiveness to SIGKILL in all kernels).  But I
think it's fine and proper to let all signals interrupt execve normally.

> additional risks we're exposed to, if any, when we allow preemption in
> current kernels?

That's not so much the point, since we already have plenty of places in
this code path that have voluntary preemption points.  We're just adding
more, which improves interactive scheduling performance for the rest of the
system when execve takes a long time copying the strings.

> - 64-bit kernel with 32-bit userland support (e.g., CONFIG_IA32_EMULATION);
> - 64-bit build of 64bit_dos.c;
> - 32-bit build of the target program;
> - no dynamic linking in the target program;
> - "ulimit -s unlimited" before running the reproducer program;
> - over 3 GB of RAM in the system.

I reproduced it too.  The lack of dynamic linking is not really a
requirement to get it to hit, though whether it fails in the kernel can
vary with the stack randomization.  (Sometimes it will instead be the
userland dynamic linker dying with a message that mmap failed.)

I think the right fix for that case is this:

diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..0000000 100644  
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm 
 #else
 	stack_top = arch_align_stack(stack_top);
 	stack_top = PAGE_ALIGN(stack_top);
+
+	if (unlikely(stack_top < mmap_min_addr) ||
+	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
+		return -ENOMEM;
+
 	stack_shift = vma->vm_end - stack_top;

 	bprm->p -= stack_shift;

That is consistent with the existing CONFIG_STACK_GROWSUP code's check
(though without the arbitrary 1GB limit it has).  We won't ever get to
shift_arg_pages if there isn't space to do the shift.

Brad made several of the same points that I made earlier and in this
message, so I concur with those. :-) 

Brad mentioned a possible BUG_ON if the executable's mappings clobber
the stack mapping.  With the fix above, I'm not aware how that could
happen.  I think it would just clobber part of the mapping with the
MAP_FIXED mapping, meaning it shortens the stack vma from the bottom, or
maybe splits it in the middle.  Then at user-mode entry point time, the
SP will point into some mapped text or data of the executable, and read
whatever is there instead of argc and argv[], etc.--it won't be as
graceful as an abortive execve, but it will just "safely" crash in some
random but normal userland sort of way.

Thanks,
Roland

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-31 11:53           ` Solar Designer
  0 siblings, 0 replies; 81+ messages in thread
From: Solar Designer @ 2010-08-31 11:53 UTC (permalink / raw)
  To: Brad Spengler
  Cc: Roland McGrath, Kees Cook, linux-kernel, oss-security, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec

Brad, Roland -

Thank you for your comments and your work on this.

On Mon, Aug 30, 2010 at 06:08:47PM -0400, Brad Spengler wrote:
> There shouldn't be any additional risk from adding the extra rescheds, 
> as copy_*_user can already sleep and be raced against via a number of 
> methods.

Agreed.

Alexander

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
@ 2010-08-31 11:53           ` Solar Designer
  0 siblings, 0 replies; 81+ messages in thread
From: Solar Designer @ 2010-08-31 11:53 UTC (permalink / raw)
  To: Brad Spengler
  Cc: Roland McGrath, Kees Cook, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	pageexec-Y8qEzhMunLyT9ig0jae3mg

Brad, Roland -

Thank you for your comments and your work on this.

On Mon, Aug 30, 2010 at 06:08:47PM -0400, Brad Spengler wrote:
> There shouldn't be any additional risk from adding the extra rescheds, 
> as copy_*_user can already sleep and be raced against via a number of 
> methods.

Agreed.

Alexander

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] exec argument expansion can inappropriately triggerOOM-killer
  2010-08-30 22:08         ` Brad Spengler
  (?)
  (?)
@ 2010-08-31 11:56         ` Tetsuo Handa
  -1 siblings, 0 replies; 81+ messages in thread
From: Tetsuo Handa @ 2010-08-31 11:56 UTC (permalink / raw)
  To: spender, solar
  Cc: roland, kees.cook, linux-kernel, oss-security, viro, akpm, oleg,
	kosaki.motohiro, nhorman, linux-fsdevel, pageexec

Brad Spengler wrote:
> The bug seems to have been introduced in 2.6.23, see:
> http://thread.gmane.org/gmane.linux.ports.hppa/752
> http://www.spinics.net/lists/linux-arch/msg01584.html
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg170491.html
> though I'm guessing the functionality was also backported to major 
> distros
As far as I know, RHEL >= 5.3 and Asianux >= 3.2 backported this functionality.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 0/3] execve argument-copying fixes
  2010-08-30 10:06     ` Roland McGrath
@ 2010-09-08  2:34         ` Roland McGrath
  2010-09-08  2:34         ` Roland McGrath
  1 sibling, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08  2:34 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

This is my take on parts of the execve large arguments copying issues
that Kees posted about, and Brad and others have been discussing.
I've only looked at the narrow area of the argument copying code
itself.  I think these are good and necessary fixes.  But I'm not
addressing the whole OOM killer/mm accounting issue, which also needs
to be fixed (and I have the impression others are already looking into that).

The following changes since commit d56557af19867edb8c0e96f8e26399698a08857f:

  Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 (2010-09-07 16:00:17 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git topic/exec-fixes

Roland McGrath (3):
      setup_arg_pages: diagnose excessive argument size
      execve: improve interactivity with large arguments
      execve: make responsive to SIGKILL with large arguments

 fs/exec.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 0/3] execve argument-copying fixes
@ 2010-09-08  2:34         ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08  2:34 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

This is my take on parts of the execve large arguments copying issues
that Kees posted about, and Brad and others have been discussing.
I've only looked at the narrow area of the argument copying code
itself.  I think these are good and necessary fixes.  But I'm not
addressing the whole OOM killer/mm accounting issue, which also needs
to be fixed (and I have the impression others are already looking into that).

The following changes since commit d56557af19867edb8c0e96f8e26399698a08857f:

  Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 (2010-09-07 16:00:17 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git topic/exec-fixes

Roland McGrath (3):
      setup_arg_pages: diagnose excessive argument size
      execve: improve interactivity with large arguments
      execve: make responsive to SIGKILL with large arguments

 fs/exec.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
  2010-09-08  2:34         ` Roland McGrath
@ 2010-09-08  2:35           ` Roland McGrath
  -1 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08  2:35 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
check the size of the argument/environment area on the stack.
When it is unworkably large, shift_arg_pages() hits its BUG_ON.
This is exploitable with a very large RLIMIT_STACK limit, to
create a crash pretty easily.

Check that the initial stack is not too large to make it possible
to map in any executable.  We're not checking that the actual
executable (or intepreter, for binfmt_elf) will fit.  So those
mappings might clobber part of the initial stack mapping.  But
that is just userland lossage that userland made happen, not a
kernel problem.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..1b63237 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
 #else
 	stack_top = arch_align_stack(stack_top);
 	stack_top = PAGE_ALIGN(stack_top);
+
+	if (unlikely(stack_top < mmap_min_addr) ||
+	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
+		return -ENOMEM;
+
 	stack_shift = vma->vm_end - stack_top;

 	bprm->p -= stack_shift;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-08  2:35           ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08  2:35 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
check the size of the argument/environment area on the stack.
When it is unworkably large, shift_arg_pages() hits its BUG_ON.
This is exploitable with a very large RLIMIT_STACK limit, to
create a crash pretty easily.

Check that the initial stack is not too large to make it possible
to map in any executable.  We're not checking that the actual
executable (or intepreter, for binfmt_elf) will fit.  So those
mappings might clobber part of the initial stack mapping.  But
that is just userland lossage that userland made happen, not a
kernel problem.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..1b63237 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
 #else
 	stack_top = arch_align_stack(stack_top);
 	stack_top = PAGE_ALIGN(stack_top);
+
+	if (unlikely(stack_top < mmap_min_addr) ||
+	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
+		return -ENOMEM;
+
 	stack_shift = vma->vm_end - stack_top;

 	bprm->p -= stack_shift;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 2/3] execve: improve interactivity with large arguments
  2010-09-08  2:34         ` Roland McGrath
@ 2010-09-08  2:36           ` Roland McGrath
  -1 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08  2:36 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

This adds a preemption point during the copying of the argument and
environment strings for execve, in copy_strings().  There is already
a preemption point in the count() loop, so this doesn't add any new
points in the abstract sense.

When the total argument+environment strings are very large, the time
spent copying them can be much more than a normal user time slice.
So this change improves the interactivity of the rest of the system
when one process is doing an execve with very large arguments.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1b63237..6f2d777 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -419,6 +419,8 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 		while (len > 0) {
 			int offset, bytes_to_copy;

+			cond_resched();
+
 			offset = pos % PAGE_SIZE;
 			if (offset == 0)
 				offset = PAGE_SIZE;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 2/3] execve: improve interactivity with large arguments
@ 2010-09-08  2:36           ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08  2:36 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

This adds a preemption point during the copying of the argument and
environment strings for execve, in copy_strings().  There is already
a preemption point in the count() loop, so this doesn't add any new
points in the abstract sense.

When the total argument+environment strings are very large, the time
spent copying them can be much more than a normal user time slice.
So this change improves the interactivity of the rest of the system
when one process is doing an execve with very large arguments.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1b63237..6f2d777 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -419,6 +419,8 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 		while (len > 0) {
 			int offset, bytes_to_copy;

+			cond_resched();
+
 			offset = pos % PAGE_SIZE;
 			if (offset == 0)
 				offset = PAGE_SIZE;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 3/3] execve: make responsive to SIGKILL with large arguments
  2010-09-08  2:34         ` Roland McGrath
@ 2010-09-08  2:37           ` Roland McGrath
  -1 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08  2:37 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

An execve with a very large total of argument/environment strings
can take a really long time in the execve system call.  It runs
uninterruptibly to count and copy all the strings.  This change
makes it abort the exec quickly if sent a SIGKILL.

Note that this is the conservative change, to interrupt only for
SIGKILL, by using fatal_signal_pending().  It would be perfectly
correct semantics to let any signal interrupt the string-copying in
execve, i.e. use signal_pending() instead of fatal_signal_pending().
We'll save that change for later, since it could have user-visible
consequences, such as having a timer set too quickly make it so that
an execve can never complete, though it always happened to work before.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 6f2d777..828dd24 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -376,6 +376,9 @@ static int count(const char __user * const __user * argv, int max)
 			argv++;
 			if (i++ >= max)
 				return -E2BIG;
+
+			if (fatal_signal_pending(current))
+				return -ERESTARTNOHAND;
 			cond_resched();
 		}
 	}
@@ -419,6 +422,10 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 		while (len > 0) {
 			int offset, bytes_to_copy;

+			if (fatal_signal_pending(current)) {
+				ret = -ERESTARTNOHAND;
+				goto out;
+			}
 			cond_resched();

 			offset = pos % PAGE_SIZE;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 3/3] execve: make responsive to SIGKILL with large arguments
@ 2010-09-08  2:37           ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-08  2:37 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

An execve with a very large total of argument/environment strings
can take a really long time in the execve system call.  It runs
uninterruptibly to count and copy all the strings.  This change
makes it abort the exec quickly if sent a SIGKILL.

Note that this is the conservative change, to interrupt only for
SIGKILL, by using fatal_signal_pending().  It would be perfectly
correct semantics to let any signal interrupt the string-copying in
execve, i.e. use signal_pending() instead of fatal_signal_pending().
We'll save that change for later, since it could have user-visible
consequences, such as having a timer set too quickly make it so that
an execve can never complete, though it always happened to work before.

Signed-off-by: Roland McGrath <roland@redhat.com>
---
 fs/exec.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 6f2d777..828dd24 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -376,6 +376,9 @@ static int count(const char __user * const __user * argv, int max)
 			argv++;
 			if (i++ >= max)
 				return -E2BIG;
+
+			if (fatal_signal_pending(current))
+				return -ERESTARTNOHAND;
 			cond_resched();
 		}
 	}
@@ -419,6 +422,10 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 		while (len > 0) {
 			int offset, bytes_to_copy;

+			if (fatal_signal_pending(current)) {
+				ret = -ERESTARTNOHAND;
+				goto out;
+			}
 			cond_resched();

 			offset = pos % PAGE_SIZE;

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 0/3] execve argument-copying fixes
  2010-09-08  2:34         ` Roland McGrath
@ 2010-09-08  3:00           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  3:00 UTC (permalink / raw)
  To: Roland McGrath
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

> This is my take on parts of the execve large arguments copying issues
> that Kees posted about, and Brad and others have been discussing.
> I've only looked at the narrow area of the argument copying code
> itself.  I think these are good and necessary fixes.  But I'm not
> addressing the whole OOM killer/mm accounting issue, which also needs
> to be fixed (and I have the impression others are already looking into that).
> 
> The following changes since commit d56557af19867edb8c0e96f8e26399698a08857f:
> 
>   Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 (2010-09-07 16:00:17 -0700)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git topic/exec-fixes
> 
> Roland McGrath (3):
>       setup_arg_pages: diagnose excessive argument size
>       execve: improve interactivity with large arguments
>       execve: make responsive to SIGKILL with large arguments
> 
>  fs/exec.c |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)


All of changes looks nice to me :)
Thanks.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 0/3] execve argument-copying fixes
@ 2010-09-08  3:00           ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  3:00 UTC (permalink / raw)
  To: Roland McGrath
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

> This is my take on parts of the execve large arguments copying issues
> that Kees posted about, and Brad and others have been discussing.
> I've only looked at the narrow area of the argument copying code
> itself.  I think these are good and necessary fixes.  But I'm not
> addressing the whole OOM killer/mm accounting issue, which also needs
> to be fixed (and I have the impression others are already looking into that).
> 
> The following changes since commit d56557af19867edb8c0e96f8e26399698a08857f:
> 
>   Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 (2010-09-07 16:00:17 -0700)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland.git topic/exec-fixes
> 
> Roland McGrath (3):
>       setup_arg_pages: diagnose excessive argument size
>       execve: improve interactivity with large arguments
>       execve: make responsive to SIGKILL with large arguments
> 
>  fs/exec.c |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)


All of changes looks nice to me :)
Thanks.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
  2010-09-08  2:35           ` Roland McGrath
@ 2010-09-08  8:29             ` pageexec
  -1 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-08  8:29 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Roland McGrath
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

On 7 Sep 2010 at 19:35, Roland McGrath wrote:

> Check that the initial stack is not too large to make it possible
> to map in any executable.  We're not checking that the actual
> executable (or intepreter, for binfmt_elf) will fit.  So those
> mappings might clobber part of the initial stack mapping.  But
> that is just userland lossage that userland made happen, not a
> kernel problem.
> 
> Signed-off-by: Roland McGrath <roland@redhat.com>
> ---
>  fs/exec.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d94552..1b63237 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
>  #else
>  	stack_top = arch_align_stack(stack_top);
>  	stack_top = PAGE_ALIGN(stack_top);
> +
> +	if (unlikely(stack_top < mmap_min_addr) ||

this could arguably elicit some warning, or even better, prevent the sysctl from
setting mmap_min_addr too high in the first place. or alternatively, just check
for

	mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)

which i think is more clear as to the intent of the check. and since the next
line already computes stack_shift as vma->vm_end - stack_top, you could do the
whole check afterwards as:

	mmap_min_addr > vma->vm_start - stack_shift

which is even simpler and more to the point (you want what is called new_start in
shift_arg_pages to not violate mmap_min_addr). now you just need the int overflow
check, say: vma->vm_start < stack_shift, so the whole thing would become:

	if (vma->vm_start < stack_shift || mmap_min_addr > vma->vm_start - stack_shift)
		return -EFAULT;

which looks even better if done in shift_arg_pages as i've done it in PaX:

-       BUG_ON(new_start > new_end);
+       if (new_start >= new_end || new_start < mmap_min_addr)
+               return -EFAULT;

 +	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))

i think the = case is as valid as any other value close to it ;). also this code
is not a fast path, no need for unlikely i think.

> +		return -ENOMEM;

i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
couldn't allocate physical memory for something, EFAULT is when there's some
issue with the address space itself (based on the reaction to find_vma or
expand_stack failures).

also what about the BUG_ON in shift_arg_pages? it could go now, right?

personally, i prefer to do this the way i did it in PaX: move up the
shift_arg_pages call a bit and turn the BUG_ON into a proper check.

>  	stack_shift = vma->vm_end - stack_top;
>  
>  	bprm->p -= stack_shift;
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-08  8:29             ` pageexec
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-08  8:29 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Roland McGrath
  Cc: linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Andrew Morton, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

On 7 Sep 2010 at 19:35, Roland McGrath wrote:

> Check that the initial stack is not too large to make it possible
> to map in any executable.  We're not checking that the actual
> executable (or intepreter, for binfmt_elf) will fit.  So those
> mappings might clobber part of the initial stack mapping.  But
> that is just userland lossage that userland made happen, not a
> kernel problem.
> 
> Signed-off-by: Roland McGrath <roland@redhat.com>
> ---
>  fs/exec.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d94552..1b63237 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
>  #else
>  	stack_top = arch_align_stack(stack_top);
>  	stack_top = PAGE_ALIGN(stack_top);
> +
> +	if (unlikely(stack_top < mmap_min_addr) ||

this could arguably elicit some warning, or even better, prevent the sysctl from
setting mmap_min_addr too high in the first place. or alternatively, just check
for

	mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)

which i think is more clear as to the intent of the check. and since the next
line already computes stack_shift as vma->vm_end - stack_top, you could do the
whole check afterwards as:

	mmap_min_addr > vma->vm_start - stack_shift

which is even simpler and more to the point (you want what is called new_start in
shift_arg_pages to not violate mmap_min_addr). now you just need the int overflow
check, say: vma->vm_start < stack_shift, so the whole thing would become:

	if (vma->vm_start < stack_shift || mmap_min_addr > vma->vm_start - stack_shift)
		return -EFAULT;

which looks even better if done in shift_arg_pages as i've done it in PaX:

-       BUG_ON(new_start > new_end);
+       if (new_start >= new_end || new_start < mmap_min_addr)
+               return -EFAULT;

 +	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))

i think the = case is as valid as any other value close to it ;). also this code
is not a fast path, no need for unlikely i think.

> +		return -ENOMEM;

i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
couldn't allocate physical memory for something, EFAULT is when there's some
issue with the address space itself (based on the reaction to find_vma or
expand_stack failures).

also what about the BUG_ON in shift_arg_pages? it could go now, right?

personally, i prefer to do this the way i did it in PaX: move up the
shift_arg_pages call a bit and turn the BUG_ON into a proper check.

>  	stack_shift = vma->vm_end - stack_top;
>  
>  	bprm->p -= stack_shift;
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
  2010-09-08  2:35           ` Roland McGrath
@ 2010-09-08 11:57             ` Brad Spengler
  -1 siblings, 0 replies; 81+ messages in thread
From: Brad Spengler @ 2010-09-08 11:57 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	KOSAKI Motohiro, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

I still don't think this addresses the whole problem.  Without question,
the rlimit / 4 check is bogus.  If nobody agrees with the intent of that 
check, then it should be removed, but I think the better solution is to 
fix the check so that it matches its original intent: let the initial 
stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon 
personality), which allows space for additional stack setup in the ELF 
loader and then further growth once the process is live.  If that 
amount is overstepped, then the exec will return an error to the calling 
process instead of being terminated.

It might be useful to consult with the people who introduced/approved 
the check in the first place, as they seemed to have reasons for 
implementing it.

-Brad

On Tue, Sep 07, 2010 at 07:35:49PM -0700, Roland McGrath wrote:
> The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
> check the size of the argument/environment area on the stack.
> When it is unworkably large, shift_arg_pages() hits its BUG_ON.
> This is exploitable with a very large RLIMIT_STACK limit, to
> create a crash pretty easily.
> 
> Check that the initial stack is not too large to make it possible
> to map in any executable.  We're not checking that the actual
> executable (or intepreter, for binfmt_elf) will fit.  So those
> mappings might clobber part of the initial stack mapping.  But
> that is just userland lossage that userland made happen, not a
> kernel problem.
> 
> Signed-off-by: Roland McGrath <roland@redhat.com>
> ---
>  fs/exec.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d94552..1b63237 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
>  #else
>  	stack_top = arch_align_stack(stack_top);
>  	stack_top = PAGE_ALIGN(stack_top);
> +
> +	if (unlikely(stack_top < mmap_min_addr) ||
> +	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
> +		return -ENOMEM;
> +
>  	stack_shift = vma->vm_end - stack_top;
>  
>  	bprm->p -= stack_shift;

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-08 11:57             ` Brad Spengler
  0 siblings, 0 replies; 81+ messages in thread
From: Brad Spengler @ 2010-09-08 11:57 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	KOSAKI Motohiro, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

I still don't think this addresses the whole problem.  Without question,
the rlimit / 4 check is bogus.  If nobody agrees with the intent of that 
check, then it should be removed, but I think the better solution is to 
fix the check so that it matches its original intent: let the initial 
stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon 
personality), which allows space for additional stack setup in the ELF 
loader and then further growth once the process is live.  If that 
amount is overstepped, then the exec will return an error to the calling 
process instead of being terminated.

It might be useful to consult with the people who introduced/approved 
the check in the first place, as they seemed to have reasons for 
implementing it.

-Brad

On Tue, Sep 07, 2010 at 07:35:49PM -0700, Roland McGrath wrote:
> The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
> check the size of the argument/environment area on the stack.
> When it is unworkably large, shift_arg_pages() hits its BUG_ON.
> This is exploitable with a very large RLIMIT_STACK limit, to
> create a crash pretty easily.
> 
> Check that the initial stack is not too large to make it possible
> to map in any executable.  We're not checking that the actual
> executable (or intepreter, for binfmt_elf) will fit.  So those
> mappings might clobber part of the initial stack mapping.  But
> that is just userland lossage that userland made happen, not a
> kernel problem.
> 
> Signed-off-by: Roland McGrath <roland@redhat.com>
> ---
>  fs/exec.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d94552..1b63237 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
>  #else
>  	stack_top = arch_align_stack(stack_top);
>  	stack_top = PAGE_ALIGN(stack_top);
> +
> +	if (unlikely(stack_top < mmap_min_addr) ||
> +	    unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
> +		return -ENOMEM;
> +
>  	stack_shift = vma->vm_end - stack_top;
>  
>  	bprm->p -= stack_shift;

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 0/2] execve memory exhaust of argument-copying fixes
  2010-09-08  2:34         ` Roland McGrath
@ 2010-09-09  5:01           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  5:01 UTC (permalink / raw)
  To: Roland McGrath
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> This is my take on parts of the execve large arguments copying issues
> that Kees posted about, and Brad and others have been discussing.
> I've only looked at the narrow area of the argument copying code
> itself.  I think these are good and necessary fixes.  But I'm not
> addressing the whole OOM killer/mm accounting issue, which also needs
> to be fixed (and I have the impression others are already looking into that).

Now, we have two OOM-Killer/mm acounting problem.
 1) OOM-killer doesn't track nascent mm and It may kill innocent task
 2) When execve argument-copying, our __vm_enough_memory() doesn't
    protect any wrong plenty argument. then, execve() invoke OOM instead
    return failure value when larger argument than system memory.

The patch series addressed this two issue.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 0/2] execve memory exhaust of argument-copying fixes
@ 2010-09-09  5:01           ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  5:01 UTC (permalink / raw)
  To: Roland McGrath
  Cc: kosaki.motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> This is my take on parts of the execve large arguments copying issues
> that Kees posted about, and Brad and others have been discussing.
> I've only looked at the narrow area of the argument copying code
> itself.  I think these are good and necessary fixes.  But I'm not
> addressing the whole OOM killer/mm accounting issue, which also needs
> to be fixed (and I have the impression others are already looking into that).

Now, we have two OOM-Killer/mm acounting problem.
 1) OOM-killer doesn't track nascent mm and It may kill innocent task
 2) When execve argument-copying, our __vm_enough_memory() doesn't
    protect any wrong plenty argument. then, execve() invoke OOM instead
    return failure value when larger argument than system memory.

The patch series addressed this two issue.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 1/2] oom: don't ignore rss in nascent mm
  2010-09-09  5:01           ` KOSAKI Motohiro
@ 2010-09-09  5:03             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  5:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

This patch was made on top "oom: remove totalpage normalization from oom_badness()" patch.

===============================
Execve() makes new mm struct and setup stack and push argv vector,
Unfortunately this nascent mm is not pointed any tasks, then
OOM-killer can't detect this memory usage. therefore OOM-killer
may kill incorrect task.

Thus, this patch added task->in_exec_mm member and track
nascent mm usage.

Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/compat.c             |    4 +++-
 fs/exec.c               |   14 +++++++++++++-
 include/linux/binfmts.h |    1 +
 include/linux/sched.h   |    1 +
 mm/oom_kill.c           |   37 +++++++++++++++++++++++++++++--------
 5 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 718c706..b631120 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1567,8 +1567,10 @@ int compat_do_execve(char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
+	if (bprm->mm) {
+		set_exec_mm(NULL);
 		mmput(bprm->mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..b41834c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -347,6 +347,8 @@ int bprm_mm_init(struct linux_binprm *bprm)
 	if (err)
 		goto err;
 
+	set_exec_mm(bprm->mm);
+
 	return 0;
 
 err:
@@ -983,6 +985,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 		goto out;
 
 	bprm->mm = NULL;		/* We're using it now */
+	set_exec_mm(NULL);
 
 	current->flags &= ~PF_RANDOMIZE;
 	flush_thread();
@@ -1314,6 +1317,13 @@ int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
 
 EXPORT_SYMBOL(search_binary_handler);
 
+void set_exec_mm(struct mm_struct *mm)
+{
+	task_lock(current);
+	current->in_exec_mm = mm;
+	task_unlock(current);
+}
+
 /*
  * sys_execve() executes a new program.
  */
@@ -1402,8 +1412,10 @@ int do_execve(const char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
+	if (bprm->mm) {
+		set_exec_mm(NULL);
 		mmput (bprm->mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a065612..2fde1ba 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -133,6 +133,7 @@ extern void install_exec_creds(struct linux_binprm *bprm);
 extern void do_coredump(long signr, int exit_code, struct pt_regs *regs);
 extern void set_binfmt(struct linux_binfmt *new);
 extern void free_bprm(struct linux_binprm *);
+extern void set_exec_mm(struct mm_struct *mm);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e61d60..bb5bf3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1226,6 +1226,7 @@ struct task_struct {
 	int pdeath_signal;  /*  The signal sent when the parent dies  */
 	/* ??? */
 	unsigned int personality;
+	struct mm_struct *in_exec_mm;
 	unsigned did_exec:1;
 	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
 				 * execve */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c1beda0..7d38435 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -120,6 +120,33 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 	return NULL;
 }
 
+/*
+ * The baseline for the badness score is the proportion of RAM that each
+ * task's rss and swap space use.
+ */
+static unsigned long oom_rss_swap_usage(struct task_struct *p)
+{
+	struct task_struct *t = p;
+	int mm_accounted = 0;
+	unsigned long points = 0;
+
+	do {
+		task_lock(t);
+		if (!mm_accounted && t->mm) {
+			points += get_mm_rss(t->mm);
+			points += get_mm_counter(t->mm, MM_SWAPENTS);
+			mm_accounted = 1;
+		}
+		if (t->in_exec_mm) {
+			points += get_mm_rss(t->in_exec_mm);
+			points += get_mm_counter(t->in_exec_mm, MM_SWAPENTS);
+		}
+		task_unlock(t);
+	} while_each_thread(p, t);
+
+	return points;
+}
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
 			   const nodemask_t *nodemask)
@@ -169,16 +196,10 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	if (p->flags & PF_OOM_ORIGIN)
 		return ULONG_MAX;
 
-	p = find_lock_task_mm(p);
-	if (!p)
+	points = oom_rss_swap_usage(p);
+	if (!points)
 		return 0;
 
-	/*
-	 * The baseline for the badness score is the proportion of RAM that each
-	 * task's rss and swap space use.
-	 */
-	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS));
-	task_unlock(p);
 
 	/*
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 1/2] oom: don't ignore rss in nascent mm
@ 2010-09-09  5:03             ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  5:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

This patch was made on top "oom: remove totalpage normalization from oom_badness()" patch.

===============================
Execve() makes new mm struct and setup stack and push argv vector,
Unfortunately this nascent mm is not pointed any tasks, then
OOM-killer can't detect this memory usage. therefore OOM-killer
may kill incorrect task.

Thus, this patch added task->in_exec_mm member and track
nascent mm usage.

Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/compat.c             |    4 +++-
 fs/exec.c               |   14 +++++++++++++-
 include/linux/binfmts.h |    1 +
 include/linux/sched.h   |    1 +
 mm/oom_kill.c           |   37 +++++++++++++++++++++++++++++--------
 5 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 718c706..b631120 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1567,8 +1567,10 @@ int compat_do_execve(char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
+	if (bprm->mm) {
+		set_exec_mm(NULL);
 		mmput(bprm->mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..b41834c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -347,6 +347,8 @@ int bprm_mm_init(struct linux_binprm *bprm)
 	if (err)
 		goto err;
 
+	set_exec_mm(bprm->mm);
+
 	return 0;
 
 err:
@@ -983,6 +985,7 @@ int flush_old_exec(struct linux_binprm * bprm)
 		goto out;
 
 	bprm->mm = NULL;		/* We're using it now */
+	set_exec_mm(NULL);
 
 	current->flags &= ~PF_RANDOMIZE;
 	flush_thread();
@@ -1314,6 +1317,13 @@ int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
 
 EXPORT_SYMBOL(search_binary_handler);
 
+void set_exec_mm(struct mm_struct *mm)
+{
+	task_lock(current);
+	current->in_exec_mm = mm;
+	task_unlock(current);
+}
+
 /*
  * sys_execve() executes a new program.
  */
@@ -1402,8 +1412,10 @@ int do_execve(const char * filename,
 	return retval;
 
 out:
-	if (bprm->mm)
+	if (bprm->mm) {
+		set_exec_mm(NULL);
 		mmput (bprm->mm);
+	}
 
 out_file:
 	if (bprm->file) {
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a065612..2fde1ba 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -133,6 +133,7 @@ extern void install_exec_creds(struct linux_binprm *bprm);
 extern void do_coredump(long signr, int exit_code, struct pt_regs *regs);
 extern void set_binfmt(struct linux_binfmt *new);
 extern void free_bprm(struct linux_binprm *);
+extern void set_exec_mm(struct mm_struct *mm);
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e61d60..bb5bf3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1226,6 +1226,7 @@ struct task_struct {
 	int pdeath_signal;  /*  The signal sent when the parent dies  */
 	/* ??? */
 	unsigned int personality;
+	struct mm_struct *in_exec_mm;
 	unsigned did_exec:1;
 	unsigned in_execve:1;	/* Tell the LSMs that the process is doing an
 				 * execve */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c1beda0..7d38435 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -120,6 +120,33 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
 	return NULL;
 }
 
+/*
+ * The baseline for the badness score is the proportion of RAM that each
+ * task's rss and swap space use.
+ */
+static unsigned long oom_rss_swap_usage(struct task_struct *p)
+{
+	struct task_struct *t = p;
+	int mm_accounted = 0;
+	unsigned long points = 0;
+
+	do {
+		task_lock(t);
+		if (!mm_accounted && t->mm) {
+			points += get_mm_rss(t->mm);
+			points += get_mm_counter(t->mm, MM_SWAPENTS);
+			mm_accounted = 1;
+		}
+		if (t->in_exec_mm) {
+			points += get_mm_rss(t->in_exec_mm);
+			points += get_mm_counter(t->in_exec_mm, MM_SWAPENTS);
+		}
+		task_unlock(t);
+	} while_each_thread(p, t);
+
+	return points;
+}
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
 			   const nodemask_t *nodemask)
@@ -169,16 +196,10 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
 	if (p->flags & PF_OOM_ORIGIN)
 		return ULONG_MAX;
 
-	p = find_lock_task_mm(p);
-	if (!p)
+	points = oom_rss_swap_usage(p);
+	if (!points)
 		return 0;
 
-	/*
-	 * The baseline for the badness score is the proportion of RAM that each
-	 * task's rss and swap space use.
-	 */
-	points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS));
-	task_unlock(p);
 
 	/*
 	 * Root processes get 3% bonus, just like the __vm_enough_memory()
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 2/2] execve: check the VM has enough memory at first
  2010-09-09  5:01           ` KOSAKI Motohiro
@ 2010-09-09  5:04             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  5:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

Now, Argv size of execve are limited by STACK_LIMIT/4. In other
words, If we are setting 'ulimit -s unlimited', we've lost any
guard of argv size.

More unfortunately, current argv setup logic is bypassing the VM
overcommitment check unintentionally. because current overcommitment
check don't care gradually increased stack.
therefore, wrong argument of execve() easily makes OOM instead
execve() failure. that's bad.

After this patch, execve() expand stack at first and receive to
check vm_enough_memory() properly. then, too long argument of
execve() than the machine memory return EFAULT properly.

Note, almost all user are using OVERCOMMIT_GUESS overcommit mode
(because it's default). It provide only guess. It doesn't works
perfectly on some case. However usually execve() failure is better
than OOM-killer and swap-storm compbination.

Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/compat.c |   38 +++++++++++++++++++++++++++++++-------
 fs/exec.c   |   38 +++++++++++++++++++++++++++++++-------
 2 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index b631120..ff32573 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1394,10 +1394,39 @@ static int compat_copy_strings(int argc, compat_uptr_t __user *argv,
 	char *kaddr = NULL;
 	unsigned long kpos = 0;
 	int ret;
+	compat_uptr_t str;
+	int len;
+	int i;
+	unsigned long total_len = 0;
+
+	for (i = 0; i < argc; i++) {
+		ret = -EFAULT;
+		if (get_user(str, argv+i))
+			goto out;
+		len = strnlen_user(compat_ptr(str), MAX_ARG_STRLEN);
+		if (!len)
+			goto out;
+
+		ret = -E2BIG;
+		if (len > MAX_ARG_STRLEN)
+			goto out;
+
+		total_len += len;
+		if (total_len > bprm->p)
+			goto out;
+	}
+
+	/*
+	 * Firstly, we try to expand stack. It also invoke
+	 * security_vm_enough_memory() and get failure when we don't
+	 * have enough space. It help to avoid stack smashing by plenty argv.
+	 */
+	ret = get_user_pages(current, bprm->mm, bprm->p - total_len,
+			     1, 1, 1, NULL, NULL);
+	if (ret < 0)
+		goto out;
 
 	while (argc-- > 0) {
-		compat_uptr_t str;
-		int len;
 		unsigned long pos;
 
 		if (get_user(str, argv+argc) ||
@@ -1406,11 +1435,6 @@ static int compat_copy_strings(int argc, compat_uptr_t __user *argv,
 			goto out;
 		}
 
-		if (len > MAX_ARG_STRLEN) {
-			ret = -E2BIG;
-			goto out;
-		}
-
 		/* We're going to work our way backwords. */
 		pos = bprm->p;
 		str += len;
diff --git a/fs/exec.c b/fs/exec.c
index b41834c..ef8b9dc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -396,10 +396,39 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 	char *kaddr = NULL;
 	unsigned long kpos = 0;
 	int ret;
+	int i;
+	unsigned long total_len = 0;
+	const char __user *str;
+	int len;
+
+	for (i = 0; i < argc; i++) {
+		ret = -EFAULT;
+		if (get_user(str, argv+i))
+			goto out;
+		len = strnlen_user(str, MAX_ARG_STRLEN);
+		if (!len)
+			goto out;
+
+		ret = -E2BIG;
+		if (!valid_arg_len(bprm, len))
+			goto out;
+
+		total_len += len;
+		if (total_len > bprm->p)
+			goto out;
+	}
+
+	/*
+	 * Firstly, we try to expand stack. It also invoke
+	 * security_vm_enough_memory() and get failure when we don't
+	 * have enough space. It help to avoid stack smashing by plenty argv.
+	 */
+	ret = get_user_pages(current, bprm->mm, bprm->p - total_len,
+			     1, 1, 1, NULL, NULL);
+	if (ret < 0)
+		goto out;
 
 	while (argc-- > 0) {
-		const char __user *str;
-		int len;
 		unsigned long pos;
 
 		if (get_user(str, argv+argc) ||
@@ -408,11 +437,6 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 			goto out;
 		}
 
-		if (!valid_arg_len(bprm, len)) {
-			ret = -E2BIG;
-			goto out;
-		}
-
 		/* We're going to work our way backwords. */
 		pos = bprm->p;
 		str += len;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-09  5:04             ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  5:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

Now, Argv size of execve are limited by STACK_LIMIT/4. In other
words, If we are setting 'ulimit -s unlimited', we've lost any
guard of argv size.

More unfortunately, current argv setup logic is bypassing the VM
overcommitment check unintentionally. because current overcommitment
check don't care gradually increased stack.
therefore, wrong argument of execve() easily makes OOM instead
execve() failure. that's bad.

After this patch, execve() expand stack at first and receive to
check vm_enough_memory() properly. then, too long argument of
execve() than the machine memory return EFAULT properly.

Note, almost all user are using OVERCOMMIT_GUESS overcommit mode
(because it's default). It provide only guess. It doesn't works
perfectly on some case. However usually execve() failure is better
than OOM-killer and swap-storm compbination.

Cc: pageexec@freemail.hu
Cc: Roland McGrath <roland@redhat.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Eugene Teo <eteo@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/compat.c |   38 +++++++++++++++++++++++++++++++-------
 fs/exec.c   |   38 +++++++++++++++++++++++++++++++-------
 2 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index b631120..ff32573 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1394,10 +1394,39 @@ static int compat_copy_strings(int argc, compat_uptr_t __user *argv,
 	char *kaddr = NULL;
 	unsigned long kpos = 0;
 	int ret;
+	compat_uptr_t str;
+	int len;
+	int i;
+	unsigned long total_len = 0;
+
+	for (i = 0; i < argc; i++) {
+		ret = -EFAULT;
+		if (get_user(str, argv+i))
+			goto out;
+		len = strnlen_user(compat_ptr(str), MAX_ARG_STRLEN);
+		if (!len)
+			goto out;
+
+		ret = -E2BIG;
+		if (len > MAX_ARG_STRLEN)
+			goto out;
+
+		total_len += len;
+		if (total_len > bprm->p)
+			goto out;
+	}
+
+	/*
+	 * Firstly, we try to expand stack. It also invoke
+	 * security_vm_enough_memory() and get failure when we don't
+	 * have enough space. It help to avoid stack smashing by plenty argv.
+	 */
+	ret = get_user_pages(current, bprm->mm, bprm->p - total_len,
+			     1, 1, 1, NULL, NULL);
+	if (ret < 0)
+		goto out;
 
 	while (argc-- > 0) {
-		compat_uptr_t str;
-		int len;
 		unsigned long pos;
 
 		if (get_user(str, argv+argc) ||
@@ -1406,11 +1435,6 @@ static int compat_copy_strings(int argc, compat_uptr_t __user *argv,
 			goto out;
 		}
 
-		if (len > MAX_ARG_STRLEN) {
-			ret = -E2BIG;
-			goto out;
-		}
-
 		/* We're going to work our way backwords. */
 		pos = bprm->p;
 		str += len;
diff --git a/fs/exec.c b/fs/exec.c
index b41834c..ef8b9dc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -396,10 +396,39 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 	char *kaddr = NULL;
 	unsigned long kpos = 0;
 	int ret;
+	int i;
+	unsigned long total_len = 0;
+	const char __user *str;
+	int len;
+
+	for (i = 0; i < argc; i++) {
+		ret = -EFAULT;
+		if (get_user(str, argv+i))
+			goto out;
+		len = strnlen_user(str, MAX_ARG_STRLEN);
+		if (!len)
+			goto out;
+
+		ret = -E2BIG;
+		if (!valid_arg_len(bprm, len))
+			goto out;
+
+		total_len += len;
+		if (total_len > bprm->p)
+			goto out;
+	}
+
+	/*
+	 * Firstly, we try to expand stack. It also invoke
+	 * security_vm_enough_memory() and get failure when we don't
+	 * have enough space. It help to avoid stack smashing by plenty argv.
+	 */
+	ret = get_user_pages(current, bprm->mm, bprm->p - total_len,
+			     1, 1, 1, NULL, NULL);
+	if (ret < 0)
+		goto out;
 
 	while (argc-- > 0) {
-		const char __user *str;
-		int len;
 		unsigned long pos;
 
 		if (get_user(str, argv+argc) ||
@@ -408,11 +437,6 @@ static int copy_strings(int argc, const char __user *const __user *argv,
 			goto out;
 		}
 
-		if (!valid_arg_len(bprm, len)) {
-			ret = -E2BIG;
-			goto out;
-		}
-
 		/* We're going to work our way backwords. */
 		pos = bprm->p;
 		str += len;
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
  2010-09-08 11:57             ` Brad Spengler
@ 2010-09-09  5:31               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  5:31 UTC (permalink / raw)
  To: Brad Spengler
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

> I still don't think this addresses the whole problem.  Without question,
> the rlimit / 4 check is bogus.  If nobody agrees with the intent of that 
> check, then it should be removed, but I think the better solution is to 
> fix the check so that it matches its original intent: let the initial 
> stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon 
> personality), which allows space for additional stack setup in the ELF 
> loader and then further growth once the process is live.  If that 
> amount is overstepped, then the exec will return an error to the calling 
> process instead of being terminated.
> 
> It might be useful to consult with the people who introduced/approved 
> the check in the first place, as they seemed to have reasons for 
> implementing it.

Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
can't change this even though you disliked. That said, we can't break
userland even though userland library is very crazy.

I don't dislike your "1/Xth of the min(rlimit, TASK_SIZE dependent upon 
> personality)" idea. however I think You and Roland haven't agreed this
point yet. he seems to want "unlimited" works as "unlimited". then, now
I don't make such patch. Instead, I would propose to insert 
__vm_enough_memory() check in execve() pass. It prevent almost argv attack.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-09  5:31               ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  5:31 UTC (permalink / raw)
  To: Brad Spengler
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

> I still don't think this addresses the whole problem.  Without question,
> the rlimit / 4 check is bogus.  If nobody agrees with the intent of that 
> check, then it should be removed, but I think the better solution is to 
> fix the check so that it matches its original intent: let the initial 
> stack setup be up to 1/Xth of the min(rlimit, TASK_SIZE dependent upon 
> personality), which allows space for additional stack setup in the ELF 
> loader and then further growth once the process is live.  If that 
> amount is overstepped, then the exec will return an error to the calling 
> process instead of being terminated.
> 
> It might be useful to consult with the people who introduced/approved 
> the check in the first place, as they seemed to have reasons for 
> implementing it.

Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
can't change this even though you disliked. That said, we can't break
userland even though userland library is very crazy.

I don't dislike your "1/Xth of the min(rlimit, TASK_SIZE dependent upon 
> personality)" idea. however I think You and Roland haven't agreed this
point yet. he seems to want "unlimited" works as "unlimited". then, now
I don't make such patch. Instead, I would propose to insert 
__vm_enough_memory() check in execve() pass. It prevent almost argv attack.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/2] oom: don't ignore rss in nascent mm
  2010-09-09  5:03             ` KOSAKI Motohiro
@ 2010-09-09 22:05               ` Oleg Nesterov
  -1 siblings, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2010-09-09 22:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

On 09/09, KOSAKI Motohiro wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1226,6 +1226,7 @@ struct task_struct {
>  	int pdeath_signal;  /*  The signal sent when the parent dies  */
>  	/* ??? */
>  	unsigned int personality;
> +	struct mm_struct *in_exec_mm;

Oh, yes, this has to be per-thread (unlike ->mm, btw).

I wonder if it makes sense to move ->cred_guard_mutex from task_struct
to signal_struct and thus make multiple-threads-inside-exec impossible.
Only one thread can win anyway.

Oleg.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/2] oom: don't ignore rss in nascent mm
@ 2010-09-09 22:05               ` Oleg Nesterov
  0 siblings, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2010-09-09 22:05 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

On 09/09, KOSAKI Motohiro wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1226,6 +1226,7 @@ struct task_struct {
>  	int pdeath_signal;  /*  The signal sent when the parent dies  */
>  	/* ??? */
>  	unsigned int personality;
> +	struct mm_struct *in_exec_mm;

Oh, yes, this has to be per-thread (unlike ->mm, btw).

I wonder if it makes sense to move ->cred_guard_mutex from task_struct
to signal_struct and thus make multiple-threads-inside-exec impossible.
Only one thread can win anyway.

Oleg.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10  8:59               ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10  8:59 UTC (permalink / raw)
  To: pageexec
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	KOSAKI Motohiro, Neil Horman, linux-fsdevel,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

> > +	if (unlikely(stack_top < mmap_min_addr) ||
> 
> this could arguably elicit some warning, or even better, prevent the
> sysctl from setting mmap_min_addr too high in the first place.

This code has local checks to ensure that things don't fail worse later.
Those are good to have regardless.  If you'd like to add some constraints
on setting mmap_min_addr, that's certainly fine too.  But it's no reason
not to have this simple and clear check here.

> or alternatively, just check for
> 
> 	mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)

IMHO that is far less clear as to what the intent of the check is than what
I wrote.  It's especially subtle that this check allows overflow and then
you check for that later, rather than the formulation I gave where no
overflow is possible and it's immediately obvious why.

> which looks even better if done in shift_arg_pages as i've done it in PaX:

My change to setup_arg_pages is consistent with the existing
CONFIG_STACK_GROWSUP code.  IMHO simple fixes should go in first
and other restructuring of the code can be done later.

> > +		return -ENOMEM;
> 
> i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
> couldn't allocate physical memory for something, EFAULT is when there's some
> issue with the address space itself (based on the reaction to find_vma or
> expand_stack failures).

I disagree.  IMHO, the only proper use of EFAULT is for the passing of a
bad pointer (including passing data that includes a bad pointer, etc.).
So execve really should only use EFAULT when the pointers passed in the
system call are bad.  I used ENOMEM purely for the reason that the
existing CONFIG_STACK_GROWSUP code I paralleled uses it.  But it
certainly does make more sense than EFAULT does.  ENOMEM is used for
cases where you couldn't get the address space you wanted (mmap), which
is a fairly close analogue to this case.

But the one error code that is actually correct for the case is E2BIG.
Obviously that's really the right thing for Argument list too long cases.

The choice of error code here is fairly academic.  Any failure of
setup_arg_pages always leads to sending ourselves a SIGKILL before
returning the error code to percolate back to execve.  So it's only the
error code seen in syscall tracing before the process dies.  It's not
"user-visible" in the usual sense (i.e. POSIX doesn't care what code we
use here).

> also what about the BUG_ON in shift_arg_pages? it could go now, right?

It is now impossible by the logic of the arithmetic, yes.  But it is
another local sanity check asserting the assumptions of the code in that
function.  So there is no reason to 

> personally, i prefer to do this the way i did it in PaX: move up the
> shift_arg_pages call a bit and turn the BUG_ON into a proper check.

I'm more comfortable with a fix that doesn't change any other aspect of the
behavior.  Probably moving the call to shift_arg_pages around is fine.  But
IMHO that belongs in a separate set of cleanup patches, and not in the
isolated fix for the BUG_ON hit.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10  8:59               ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10  8:59 UTC (permalink / raw)
  To: pageexec-Y8qEzhMunLyT9ig0jae3mg
  Cc: Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Brad Spengler
	<spender-JNS0hek0TMl4qEwOxq4T+Q@public.gmane.org> Eugene
	Teo

> > +	if (unlikely(stack_top < mmap_min_addr) ||
> 
> this could arguably elicit some warning, or even better, prevent the
> sysctl from setting mmap_min_addr too high in the first place.

This code has local checks to ensure that things don't fail worse later.
Those are good to have regardless.  If you'd like to add some constraints
on setting mmap_min_addr, that's certainly fine too.  But it's no reason
not to have this simple and clear check here.

> or alternatively, just check for
> 
> 	mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)

IMHO that is far less clear as to what the intent of the check is than what
I wrote.  It's especially subtle that this check allows overflow and then
you check for that later, rather than the formulation I gave where no
overflow is possible and it's immediately obvious why.

> which looks even better if done in shift_arg_pages as i've done it in PaX:

My change to setup_arg_pages is consistent with the existing
CONFIG_STACK_GROWSUP code.  IMHO simple fixes should go in first
and other restructuring of the code can be done later.

> > +		return -ENOMEM;
> 
> i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
> couldn't allocate physical memory for something, EFAULT is when there's some
> issue with the address space itself (based on the reaction to find_vma or
> expand_stack failures).

I disagree.  IMHO, the only proper use of EFAULT is for the passing of a
bad pointer (including passing data that includes a bad pointer, etc.).
So execve really should only use EFAULT when the pointers passed in the
system call are bad.  I used ENOMEM purely for the reason that the
existing CONFIG_STACK_GROWSUP code I paralleled uses it.  But it
certainly does make more sense than EFAULT does.  ENOMEM is used for
cases where you couldn't get the address space you wanted (mmap), which
is a fairly close analogue to this case.

But the one error code that is actually correct for the case is E2BIG.
Obviously that's really the right thing for Argument list too long cases.

The choice of error code here is fairly academic.  Any failure of
setup_arg_pages always leads to sending ourselves a SIGKILL before
returning the error code to percolate back to execve.  So it's only the
error code seen in syscall tracing before the process dies.  It's not
"user-visible" in the usual sense (i.e. POSIX doesn't care what code we
use here).

> also what about the BUG_ON in shift_arg_pages? it could go now, right?

It is now impossible by the logic of the arithmetic, yes.  But it is
another local sanity check asserting the assumptions of the code in that
function.  So there is no reason to 

> personally, i prefer to do this the way i did it in PaX: move up the
> shift_arg_pages call a bit and turn the BUG_ON into a proper check.

I'm more comfortable with a fix that doesn't change any other aspect of the
behavior.  Probably moving the call to shift_arg_pages around is fine.  But
IMHO that belongs in a separate set of cleanup patches, and not in the
isolated fix for the BUG_ON hit.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10  9:18               ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10  9:18 UTC (permalink / raw)
  To: Brad Spengler
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	KOSAKI Motohiro, Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

> I still don't think this addresses the whole problem.  

You're responding to one of three patches, and I said to begin with that
all together they only address part of the problem (not the OOM part).
So the import of this remark is somewhat mysterious.

> Without question, the rlimit / 4 check is bogus.  

I question that assertion.  For a non-RLIM_INFINITY limit, there is nothing
in particular wrong with it.  The kernel is free to pick its upper bound
for ARG_MAX by whatever method.  I don't see anything much to object to
about the rlimit/4 method.  It has no useful effect for RLIM_INFINITY and
IMHO should not try to impose any limit in that case.  But that's the only
thing I see a reason to change.

> If nobody agrees with the intent of that check, then it should be
> removed, but I think the better solution is to fix the check so that it
> matches its original intent: let the initial stack setup be up to 1/Xth
> of the min(rlimit, TASK_SIZE dependent upon personality), which allows
> space for additional stack setup in the ELF loader and then further
> growth once the process is live.  

I see no reason to suspect this was the "original intent".  It seems most
likely to me that the original intent was 1/4th the RLIMIT_STACK size, and
just nobody thought about what that meant when RLIMIT_STACK was RLIM_INFINITY.

> If that amount is overstepped, then the exec will return an error to the
> calling process instead of being terminated.

That's what happens now when RLIMIT_STACK is smaller, and that's what
people really care about.  What you suggest would require some more
significant changes to the exec code path, touching all the binfmt modules
(though probably only binfmt_elf matters).  

In the current structure of the code, the arch-dependent SET_PERSONALITY
macro used by {compat_,}binfmt_elf is the only place that knows what arch
bits to set for the new address space size.  This is itself destructive,
but also runs after flush_old_exec (the point of no return).  So you'd have
to reorganize things significantly, or add an entirely new arch macro tied
into struct binfmt somehow, or something like that.

> It might be useful to consult with the people who introduced/approved 
> the check in the first place, as they seemed to have reasons for 
> implementing it.

This was done in commit b6a2fea by Ollie Wild <aaw@google.com>:
	mm: variable length argument support
It was part of going from a fixed maximum to no fixed maximum.
The log includes:
    [a.p.zijlstra@chello.nl: limit stack size]
So perhaps it was Peter who devised the rlimit/4 idea.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10  9:18               ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10  9:18 UTC (permalink / raw)
  To: Brad Spengler
  Cc: Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	pageexec-Y8qEzhMunLyT9ig0jae3mg,
	Brad Spengler
	<spender-JNS0hek0TMl4qEwOxq4T+Q@public.gmane.org> Eugene
	Teo

> I still don't think this addresses the whole problem.  

You're responding to one of three patches, and I said to begin with that
all together they only address part of the problem (not the OOM part).
So the import of this remark is somewhat mysterious.

> Without question, the rlimit / 4 check is bogus.  

I question that assertion.  For a non-RLIM_INFINITY limit, there is nothing
in particular wrong with it.  The kernel is free to pick its upper bound
for ARG_MAX by whatever method.  I don't see anything much to object to
about the rlimit/4 method.  It has no useful effect for RLIM_INFINITY and
IMHO should not try to impose any limit in that case.  But that's the only
thing I see a reason to change.

> If nobody agrees with the intent of that check, then it should be
> removed, but I think the better solution is to fix the check so that it
> matches its original intent: let the initial stack setup be up to 1/Xth
> of the min(rlimit, TASK_SIZE dependent upon personality), which allows
> space for additional stack setup in the ELF loader and then further
> growth once the process is live.  

I see no reason to suspect this was the "original intent".  It seems most
likely to me that the original intent was 1/4th the RLIMIT_STACK size, and
just nobody thought about what that meant when RLIMIT_STACK was RLIM_INFINITY.

> If that amount is overstepped, then the exec will return an error to the
> calling process instead of being terminated.

That's what happens now when RLIMIT_STACK is smaller, and that's what
people really care about.  What you suggest would require some more
significant changes to the exec code path, touching all the binfmt modules
(though probably only binfmt_elf matters).  

In the current structure of the code, the arch-dependent SET_PERSONALITY
macro used by {compat_,}binfmt_elf is the only place that knows what arch
bits to set for the new address space size.  This is itself destructive,
but also runs after flush_old_exec (the point of no return).  So you'd have
to reorganize things significantly, or add an entirely new arch macro tied
into struct binfmt somehow, or something like that.

> It might be useful to consult with the people who introduced/approved 
> the check in the first place, as they seemed to have reasons for 
> implementing it.

This was done in commit b6a2fea by Ollie Wild <aaw-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>:
	mm: variable length argument support
It was part of going from a fixed maximum to no fixed maximum.
The log includes:
    [a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org: limit stack size]
So perhaps it was Peter who devised the rlimit/4 idea.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10  9:25                 ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10  9:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Brad Spengler, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec, Brad Spengler, Eugene Teo

> Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> can't change this even though you disliked. That said, we can't break
> userland even though userland library is very crazy.

I'm sorry you think it's "very crazy" to implement the required
functionality in the only way available.  POSIX requires that execve
fail with E2BIG when the ARG_MAX limit is exceeded.  sysconf has to
return the correct actual limit that execve will enforce so that a
conforming application knows how much it can safely attempt to use.
Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
not expose the true manifest limit any other way, sysconf has to
parallel the kernel's calculation.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-10  9:25                 ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10  9:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	pageexec-Y8qEzhMunLyT9ig0jae3mg, Brad Spengler, Eugene Teo

> Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> can't change this even though you disliked. That said, we can't break
> userland even though userland library is very crazy.

I'm sorry you think it's "very crazy" to implement the required
functionality in the only way available.  POSIX requires that execve
fail with E2BIG when the ARG_MAX limit is exceeded.  sysconf has to
return the correct actual limit that execve will enforce so that a
conforming application knows how much it can safely attempt to use.
Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
not expose the true manifest limit any other way, sysconf has to
parallel the kernel's calculation.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/2] oom: don't ignore rss in nascent mm
@ 2010-09-10  9:39                 ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10  9:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
	linux-fsdevel, pageexec, Brad Spengler, Eugene Teo,
	KAMEZAWA Hiroyuki

> I wonder if it makes sense to move ->cred_guard_mutex from task_struct
> to signal_struct and thus make multiple-threads-inside-exec impossible.
> Only one thread can win anyway.

That probably makes sense.  Note that cred_guard_mutex is also overloaded
for ptrace_attach, so this would add some more serialization of attaches to
threads in the same group.  But as long as actual attachment serializes on
tasklist_lock anyway, it doesn't make a material difference.  (Even without
that, it would presumably be the same debugger attaching serially to
threads in the same group, so it wouldn't degrade anything in practice.)

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/2] oom: don't ignore rss in nascent mm
@ 2010-09-10  9:39                 ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-10  9:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	pageexec-Y8qEzhMunLyT9ig0jae3mg, Brad Spengler, Eugene Teo,
	KAMEZAWA Hiroyuki

> I wonder if it makes sense to move ->cred_guard_mutex from task_struct
> to signal_struct and thus make multiple-threads-inside-exec impossible.
> Only one thread can win anyway.

That probably makes sense.  Note that cred_guard_mutex is also overloaded
for ptrace_attach, so this would add some more serialization of attaches to
threads in the same group.  But as long as actual attachment serializes on
tasklist_lock anyway, it doesn't make a material difference.  (Even without
that, it would presumably be the same debugger attaching serially to
threads in the same group, so it wouldn't degrade anything in practice.)

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
  2010-09-10  9:25                 ` Roland McGrath
  (?)
@ 2010-09-10  9:43                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10  9:43 UTC (permalink / raw)
  To: Roland McGrath
  Cc: kosaki.motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, pageexec, Eugene Teo

> > Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> > by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> > can't change this even though you disliked. That said, we can't break
> > userland even though userland library is very crazy.
> 
> I'm sorry you think it's "very crazy" to implement the required
> functionality in the only way available.  POSIX requires that execve
> fail with E2BIG when the ARG_MAX limit is exceeded.  sysconf has to
> return the correct actual limit that execve will enforce so that a
> conforming application knows how much it can safely attempt to use.
> Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
> not expose the true manifest limit any other way, sysconf has to
> parallel the kernel's calculation.

Hmm...
Probably my poor english leaded to misunderstood. I didn't intent glibc
is very crazy. I only intended to "even if userland is crazy, I disagree
to break userland".

And yes, we obviously need to expose ARG_MAX limit to libc. a duplicated
heuristic code easily makes confusion and mistake. nobody want such 
fragile state. however, it's a bit offtopic. anyway.


Thanks.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH] move cred_guard_mutex from task_struct to signal_struct
  2010-09-09 22:05               ` Oleg Nesterov
@ 2010-09-10  9:57                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10  9:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> On 09/09, KOSAKI Motohiro wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1226,6 +1226,7 @@ struct task_struct {
> >  	int pdeath_signal;  /*  The signal sent when the parent dies  */
> >  	/* ??? */
> >  	unsigned int personality;
> > +	struct mm_struct *in_exec_mm;
> 
> Oh, yes, this has to be per-thread (unlike ->mm, btw).
> 
> I wonder if it makes sense to move ->cred_guard_mutex from task_struct
> to signal_struct and thus make multiple-threads-inside-exec impossible.
> Only one thread can win anyway.

Interesting idea, really.
I've thought

1) moving cread_guard_mutex itself
   - no increase execve overhead
	-> very good
   - it also prevent parallel ptrace
	-> probably no problem. nobody want parallel debugging
2) move in_exec_mm to signal_struct too
   -> very hard. oom-killer can use very few lock because it's called
      from various place. now both ->mm and ->in_exec_mm are protected
      task_lock() and it help to avoid messy.


So, I've done only (1). I think this restriction effectivly prevent 
some theorical execve() resouce smashing attack.


======================================================================
Subject: [PATCH] move cred_guard_mutex from task_struct to signal_struct

Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent execve()
has no worth.

Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/exec.c                 |    8 ++++----
 fs/proc/base.c            |    8 ++++----
 include/linux/init_task.h |    4 ++--
 include/linux/sched.h     |    7 ++++---
 kernel/cred.c             |    2 --
 kernel/fork.c             |    2 ++
 kernel/ptrace.c           |    4 ++--
 7 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 81d0d06..052ed54 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1091,14 +1091,14 @@ EXPORT_SYMBOL(setup_new_exec);
  */
 int prepare_bprm_creds(struct linux_binprm *bprm)
 {
-	if (mutex_lock_interruptible(&current->cred_guard_mutex))
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
 
-	mutex_unlock(&current->cred_guard_mutex);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 	return -ENOMEM;
 }
 
@@ -1106,7 +1106,7 @@ void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		mutex_unlock(&current->cred_guard_mutex);
+		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
 	kfree(bprm);
@@ -1127,7 +1127,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
-	mutex_unlock(&current->cred_guard_mutex);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 55a16f2..fd97c8f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -226,7 +226,7 @@ struct mm_struct *mm_for_maps(struct task_struct *task)
 {
 	struct mm_struct *mm;
 
-	if (mutex_lock_killable(&task->cred_guard_mutex))
+	if (mutex_lock_killable(&task->signal->cred_guard_mutex))
 		return NULL;
 
 	mm = get_task_mm(task);
@@ -235,7 +235,7 @@ struct mm_struct *mm_for_maps(struct task_struct *task)
 		mmput(mm);
 		mm = NULL;
 	}
-	mutex_unlock(&task->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
 
 	return mm;
 }
@@ -2273,14 +2273,14 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 		goto out_free;
 
 	/* Guard against adverse ptrace interaction */
-	length = mutex_lock_interruptible(&task->cred_guard_mutex);
+	length = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
 	if (length < 0)
 		goto out_free;
 
 	length = security_setprocattr(task,
 				      (char*)file->f_path.dentry->d_name.name,
 				      (void*)page, count);
-	mutex_unlock(&task->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
 out_free:
 	free_page((unsigned long) page);
 out:
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f43fa5..ff3cc33 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -29,6 +29,8 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	.cred_guard_mutex =						\
+		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
 }
 
 extern struct nsproxy init_nsproxy;
@@ -139,8 +141,6 @@ extern struct cred init_cred;
 	.group_leader	= &tsk,						\
 	.real_cred	= &init_cred,					\
 	.cred		= &init_cred,					\
-	.cred_guard_mutex =						\
-		 __MUTEX_INITIALIZER(tsk.cred_guard_mutex),		\
 	.comm		= "swapper",					\
 	.thread		= INIT_THREAD,					\
 	.fs		= &init_fs,					\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb5bf3d..a7e7c2a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -623,6 +623,10 @@ struct signal_struct {
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	long oom_score_adj;	/* OOM kill score adjustment */
+
+	struct mutex cred_guard_mutex;	/* guard against foreign influences on
+					 * credential calculations
+					 * (notably. ptrace) */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
@@ -1293,9 +1297,6 @@ struct task_struct {
 					 * credentials (COW) */
 	const struct cred *cred;	/* effective (overridable) subjective task
 					 * credentials (COW) */
-	struct mutex cred_guard_mutex;	/* guard against foreign influences on
-					 * credential calculations
-					 * (notably. ptrace) */
 	struct cred *replacement_session_keyring; /* for KEYCTL_SESSION_TO_PARENT */
 
 	char comm[TASK_COMM_LEN]; /* executable name excluding path
diff --git a/kernel/cred.c b/kernel/cred.c
index 9a3e226..a81833d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -384,8 +384,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 	struct cred *new;
 	int ret;
 
-	mutex_init(&p->cred_guard_mutex);
-
 	if (
 #ifdef CONFIG_KEYS
 		!p->cred->thread_keyring &&
diff --git a/kernel/fork.c b/kernel/fork.c
index b7e9d60..4c0d3ea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -904,6 +904,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 
+	mutex_init(&sig->cred_guard_mutex);
+
 	return 0;
 }
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index f34d798..ac5013a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -181,7 +181,7 @@ int ptrace_attach(struct task_struct *task)
 	 * under ptrace.
 	 */
 	retval = -ERESTARTNOINTR;
-	if (mutex_lock_interruptible(&task->cred_guard_mutex))
+	if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
 		goto out;
 
 	task_lock(task);
@@ -208,7 +208,7 @@ int ptrace_attach(struct task_struct *task)
 unlock_tasklist:
 	write_unlock_irq(&tasklist_lock);
 unlock_creds:
-	mutex_unlock(&task->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
 out:
 	return retval;
 }
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH] move cred_guard_mutex from task_struct to signal_struct
@ 2010-09-10  9:57                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10  9:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> On 09/09, KOSAKI Motohiro wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1226,6 +1226,7 @@ struct task_struct {
> >  	int pdeath_signal;  /*  The signal sent when the parent dies  */
> >  	/* ??? */
> >  	unsigned int personality;
> > +	struct mm_struct *in_exec_mm;
> 
> Oh, yes, this has to be per-thread (unlike ->mm, btw).
> 
> I wonder if it makes sense to move ->cred_guard_mutex from task_struct
> to signal_struct and thus make multiple-threads-inside-exec impossible.
> Only one thread can win anyway.

Interesting idea, really.
I've thought

1) moving cread_guard_mutex itself
   - no increase execve overhead
	-> very good
   - it also prevent parallel ptrace
	-> probably no problem. nobody want parallel debugging
2) move in_exec_mm to signal_struct too
   -> very hard. oom-killer can use very few lock because it's called
      from various place. now both ->mm and ->in_exec_mm are protected
      task_lock() and it help to avoid messy.


So, I've done only (1). I think this restriction effectivly prevent 
some theorical execve() resouce smashing attack.


======================================================================
Subject: [PATCH] move cred_guard_mutex from task_struct to signal_struct

Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent execve()
has no worth.

Let's move ->cred_guard_mutex from task_struct to signal_struct. It
naturally prevent multiple-threads-inside-exec.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/exec.c                 |    8 ++++----
 fs/proc/base.c            |    8 ++++----
 include/linux/init_task.h |    4 ++--
 include/linux/sched.h     |    7 ++++---
 kernel/cred.c             |    2 --
 kernel/fork.c             |    2 ++
 kernel/ptrace.c           |    4 ++--
 7 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 81d0d06..052ed54 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1091,14 +1091,14 @@ EXPORT_SYMBOL(setup_new_exec);
  */
 int prepare_bprm_creds(struct linux_binprm *bprm)
 {
-	if (mutex_lock_interruptible(&current->cred_guard_mutex))
+	if (mutex_lock_interruptible(&current->signal->cred_guard_mutex))
 		return -ERESTARTNOINTR;
 
 	bprm->cred = prepare_exec_creds();
 	if (likely(bprm->cred))
 		return 0;
 
-	mutex_unlock(&current->cred_guard_mutex);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 	return -ENOMEM;
 }
 
@@ -1106,7 +1106,7 @@ void free_bprm(struct linux_binprm *bprm)
 {
 	free_arg_pages(bprm);
 	if (bprm->cred) {
-		mutex_unlock(&current->cred_guard_mutex);
+		mutex_unlock(&current->signal->cred_guard_mutex);
 		abort_creds(bprm->cred);
 	}
 	kfree(bprm);
@@ -1127,7 +1127,7 @@ void install_exec_creds(struct linux_binprm *bprm)
 	 * credentials; any time after this it may be unlocked.
 	 */
 	security_bprm_committed_creds(bprm);
-	mutex_unlock(&current->cred_guard_mutex);
+	mutex_unlock(&current->signal->cred_guard_mutex);
 }
 EXPORT_SYMBOL(install_exec_creds);
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 55a16f2..fd97c8f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -226,7 +226,7 @@ struct mm_struct *mm_for_maps(struct task_struct *task)
 {
 	struct mm_struct *mm;
 
-	if (mutex_lock_killable(&task->cred_guard_mutex))
+	if (mutex_lock_killable(&task->signal->cred_guard_mutex))
 		return NULL;
 
 	mm = get_task_mm(task);
@@ -235,7 +235,7 @@ struct mm_struct *mm_for_maps(struct task_struct *task)
 		mmput(mm);
 		mm = NULL;
 	}
-	mutex_unlock(&task->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
 
 	return mm;
 }
@@ -2273,14 +2273,14 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
 		goto out_free;
 
 	/* Guard against adverse ptrace interaction */
-	length = mutex_lock_interruptible(&task->cred_guard_mutex);
+	length = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
 	if (length < 0)
 		goto out_free;
 
 	length = security_setprocattr(task,
 				      (char*)file->f_path.dentry->d_name.name,
 				      (void*)page, count);
-	mutex_unlock(&task->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
 out_free:
 	free_page((unsigned long) page);
 out:
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f43fa5..ff3cc33 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -29,6 +29,8 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	.cred_guard_mutex =						\
+		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
 }
 
 extern struct nsproxy init_nsproxy;
@@ -139,8 +141,6 @@ extern struct cred init_cred;
 	.group_leader	= &tsk,						\
 	.real_cred	= &init_cred,					\
 	.cred		= &init_cred,					\
-	.cred_guard_mutex =						\
-		 __MUTEX_INITIALIZER(tsk.cred_guard_mutex),		\
 	.comm		= "swapper",					\
 	.thread		= INIT_THREAD,					\
 	.fs		= &init_fs,					\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb5bf3d..a7e7c2a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -623,6 +623,10 @@ struct signal_struct {
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
 	long oom_score_adj;	/* OOM kill score adjustment */
+
+	struct mutex cred_guard_mutex;	/* guard against foreign influences on
+					 * credential calculations
+					 * (notably. ptrace) */
 };
 
 /* Context switch must be unlocked if interrupts are to be enabled */
@@ -1293,9 +1297,6 @@ struct task_struct {
 					 * credentials (COW) */
 	const struct cred *cred;	/* effective (overridable) subjective task
 					 * credentials (COW) */
-	struct mutex cred_guard_mutex;	/* guard against foreign influences on
-					 * credential calculations
-					 * (notably. ptrace) */
 	struct cred *replacement_session_keyring; /* for KEYCTL_SESSION_TO_PARENT */
 
 	char comm[TASK_COMM_LEN]; /* executable name excluding path
diff --git a/kernel/cred.c b/kernel/cred.c
index 9a3e226..a81833d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -384,8 +384,6 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 	struct cred *new;
 	int ret;
 
-	mutex_init(&p->cred_guard_mutex);
-
 	if (
 #ifdef CONFIG_KEYS
 		!p->cred->thread_keyring &&
diff --git a/kernel/fork.c b/kernel/fork.c
index b7e9d60..4c0d3ea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -904,6 +904,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->oom_adj = current->signal->oom_adj;
 	sig->oom_score_adj = current->signal->oom_score_adj;
 
+	mutex_init(&sig->cred_guard_mutex);
+
 	return 0;
 }
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index f34d798..ac5013a 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -181,7 +181,7 @@ int ptrace_attach(struct task_struct *task)
 	 * under ptrace.
 	 */
 	retval = -ERESTARTNOINTR;
-	if (mutex_lock_interruptible(&task->cred_guard_mutex))
+	if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
 		goto out;
 
 	task_lock(task);
@@ -208,7 +208,7 @@ int ptrace_attach(struct task_struct *task)
 unlock_tasklist:
 	write_unlock_irq(&tasklist_lock);
 unlock_creds:
-	mutex_unlock(&task->cred_guard_mutex);
+	mutex_unlock(&task->signal->cred_guard_mutex);
 out:
 	return retval;
 }
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/2] execve: check the VM has enough memory at first
  2010-09-09  5:04             ` KOSAKI Motohiro
@ 2010-09-10 15:06               ` Linus Torvalds
  -1 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2010-09-10 15:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>
> After this patch, execve() expand stack at first and receive to
> check vm_enough_memory() properly. then, too long argument of
> execve() than the machine memory return EFAULT properly.

This is horrible. We don't want to walk the arguments one more time
just for this. Let's just improve the checks that we do as we go
along.

                            Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-10 15:06               ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2010-09-10 15:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>
> After this patch, execve() expand stack at first and receive to
> check vm_enough_memory() properly. then, too long argument of
> execve() than the machine memory return EFAULT properly.

This is horrible. We don't want to walk the arguments one more time
just for this. Let's just improve the checks that we do as we go
along.

                            Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] move cred_guard_mutex from task_struct to signal_struct
  2010-09-10  9:57                 ` KOSAKI Motohiro
@ 2010-09-10 17:24                   ` Oleg Nesterov
  -1 siblings, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2010-09-10 17:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

On 09/10, KOSAKI Motohiro wrote:
>
> 1) moving cread_guard_mutex itself
>    - no increase execve overhead
> 	-> very good
>    - it also prevent parallel ptrace

No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
pointed out it also needs write_lock(tasklist) which is worse. So
this change doesn't make any practical harm for ptrace.

> 2) move in_exec_mm to signal_struct too
>    -> very hard. oom-killer can use very few lock because it's called
>       from various place. now both ->mm and ->in_exec_mm are protected
>       task_lock() and it help to avoid messy.

Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
anyway, this means it can't race with de_thread() changing the leader.
But up to you.

Another very minor nit (but again, up to you). Perhaps exec_mmap()
could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
matter), it takes task_lock(current) anyway (and at this point current
is always the group leader).

> Let's move ->cred_guard_mutex from task_struct to signal_struct. It
> naturally prevent multiple-threads-inside-exec.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

This is very minor, but perhaps you can also fix a couple of comments
which mention task->cred_guard_mutex,

	fs/exec.c:1109		the caller must hold current->cred_guard_mutex
	kernel/cred.c:328	The caller must hold current->cred_guard_mutex
	include/linux/tracehook.h:153	@task->cred_guard_mutex

Oleg.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] move cred_guard_mutex from task_struct to signal_struct
@ 2010-09-10 17:24                   ` Oleg Nesterov
  0 siblings, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2010-09-10 17:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

On 09/10, KOSAKI Motohiro wrote:
>
> 1) moving cread_guard_mutex itself
>    - no increase execve overhead
> 	-> very good
>    - it also prevent parallel ptrace

No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
pointed out it also needs write_lock(tasklist) which is worse. So
this change doesn't make any practical harm for ptrace.

> 2) move in_exec_mm to signal_struct too
>    -> very hard. oom-killer can use very few lock because it's called
>       from various place. now both ->mm and ->in_exec_mm are protected
>       task_lock() and it help to avoid messy.

Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
anyway, this means it can't race with de_thread() changing the leader.
But up to you.

Another very minor nit (but again, up to you). Perhaps exec_mmap()
could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
matter), it takes task_lock(current) anyway (and at this point current
is always the group leader).

> Let's move ->cred_guard_mutex from task_struct to signal_struct. It
> naturally prevent multiple-threads-inside-exec.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

This is very minor, but perhaps you can also fix a couple of comments
which mention task->cred_guard_mutex,

	fs/exec.c:1109		the caller must hold current->cred_guard_mutex
	kernel/cred.c:328	The caller must hold current->cred_guard_mutex
	include/linux/tracehook.h:153	@task->cred_guard_mutex

Oleg.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-11 13:30                 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-11 13:30 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	KOSAKI Motohiro, Neil Horman, linux-fsdevel,
	Brad Spengler <spender@grsecurity.net> Eugene Teo

On 10 Sep 2010 at 1:59, Roland McGrath wrote:

> > > +	if (unlikely(stack_top < mmap_min_addr) ||
> > 
> > this could arguably elicit some warning, or even better, prevent the
> > sysctl from setting mmap_min_addr too high in the first place.
> 
> This code has local checks to ensure that things don't fail worse later.
> Those are good to have regardless.  If you'd like to add some constraints
> on setting mmap_min_addr, that's certainly fine too.  But it's no reason
> not to have this simple and clear check here.

it's somewhat confusing as it does not immediately relate to the problem
you're addressing. let's take a step back and see what issues we have here:

1. the would-be stack vma shifts down so much that it becomes completely
   invalid (due to the int wraparound on its start address).

2. the would be stack vma shifts down to violate mmap_min_addr.

it's obvious that before ensuring 2, 1 must pass. the check for 1 does not
at all involve mmap_min_addr, whereas both of your checks do. even worse, the
check for 2 doesn't depend on the relation between stack_top and mmap_min_addr
per se (userland has no influence over them), so as i said, it's all confusing.

> > or alternatively, just check for
> > 
> > 	mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)
> 
> IMHO that is far less clear as to what the intent of the check is than what
> I wrote.

why is that? it's obvious that vm_end-vm_size=vm_size and the check simply
ensures that problem 2 above doesn't occur whereas in neither of your checks
can one see the clear intent for ensuring 2. your 2nd check basically ensures
that vm_size fits in the usable address space bounded by [mmap_min_addr, stack_top]
but it doesn't immediately tell the user why that size check is useful (especially
since we're not concerned with sizes per se, but addresses, it's just a corollary
in this particular case that a size check can replace an address check, but that
doesn't mean it'll help any future reader see the original intent). so while
the math is ok in your checks (modulo the = case), it's a kind of premature
optimization that hides the problems you're addressing.

>  It's especially subtle that this check allows overflow
> and then you check for that later,

no, actually that particular check was for addressing one particular
problem (to show you how to formulate it that makes intent clear), as a
train of thought if you like. the final proposed version (which has been
in PaX) has the int overflow check first and then the mmap_min_addr check.

> rather than the formulation I gave
> where no overflow is possible and it's immediately obvious why. 

you're only focused on the math and let the human intent slip by, that's
the issue i have with your checks (and you didn't get the = corner case
right btw, it should be allowed, nor refused if you want to be pedantic).

really, in security clarity of intent and its manifestation in code takes
precedence over optimization, let the compiler do the latter job.

> > which looks even better if done in shift_arg_pages as i've done it in PaX:
> 
> My change to setup_arg_pages is consistent with the existing
> CONFIG_STACK_GROWSUP code.

the GROWSUP code doesn't use unlikely (very confusing to see it in this
context since you're not addressing any hot path/performance issues), nor
does it care about mmap_min_addr (whereas both of your checks do which
depending on the generated asm, may even be racy when mmap_min_addr changes)
nor does it check for any int overflows. about the only similarity i can
see is its computation of 'vm_size' and an associated size check (properly
commented unlike your code) but it's normal there, whereas the problems you're
addressing are fundamentally vma address related, not vma size related.

> IMHO simple fixes should go in first
> and other restructuring of the code can be done later.

yours is not a simple fix as it doesn't make it clear what two problems
you're trying to address with your checks.

> > > +		return -ENOMEM;
> > 
> > i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
> > couldn't allocate physical memory for something, EFAULT is when there's some
> > issue with the address space itself (based on the reaction to find_vma or
> > expand_stack failures).
> 
> I disagree.

well, in that case the EFAULT return codes in this function need some
cleanup too, you know, in the name of consistency ;).

> > also what about the BUG_ON in shift_arg_pages? it could go now, right?
> 
> It is now impossible by the logic of the arithmetic, yes.  But it is
> another local sanity check asserting the assumptions of the code in that
> function.  So there is no reason to 

there is no reason for it to be a BUG_ON (never was in fact), the kernel
can easily recover from the detected condition.

cheers,

 PaX Team

PS: i thought it was common courtesy and customary to credit people in the
commit who found/reported/fixed problems.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-11 13:30                 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-11 13:30 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Brad Spengler
	<spender-JNS0hek0TMl4qEwOxq4T+Q@public.gmane.org> Eugene
	Teo

On 10 Sep 2010 at 1:59, Roland McGrath wrote:

> > > +	if (unlikely(stack_top < mmap_min_addr) ||
> > 
> > this could arguably elicit some warning, or even better, prevent the
> > sysctl from setting mmap_min_addr too high in the first place.
> 
> This code has local checks to ensure that things don't fail worse later.
> Those are good to have regardless.  If you'd like to add some constraints
> on setting mmap_min_addr, that's certainly fine too.  But it's no reason
> not to have this simple and clear check here.

it's somewhat confusing as it does not immediately relate to the problem
you're addressing. let's take a step back and see what issues we have here:

1. the would-be stack vma shifts down so much that it becomes completely
   invalid (due to the int wraparound on its start address).

2. the would be stack vma shifts down to violate mmap_min_addr.

it's obvious that before ensuring 2, 1 must pass. the check for 1 does not
at all involve mmap_min_addr, whereas both of your checks do. even worse, the
check for 2 doesn't depend on the relation between stack_top and mmap_min_addr
per se (userland has no influence over them), so as i said, it's all confusing.

> > or alternatively, just check for
> > 
> > 	mmap_min_addr > stack_top - (vma->vm_end - vma->vm_start)
> 
> IMHO that is far less clear as to what the intent of the check is than what
> I wrote.

why is that? it's obvious that vm_end-vm_size=vm_size and the check simply
ensures that problem 2 above doesn't occur whereas in neither of your checks
can one see the clear intent for ensuring 2. your 2nd check basically ensures
that vm_size fits in the usable address space bounded by [mmap_min_addr, stack_top]
but it doesn't immediately tell the user why that size check is useful (especially
since we're not concerned with sizes per se, but addresses, it's just a corollary
in this particular case that a size check can replace an address check, but that
doesn't mean it'll help any future reader see the original intent). so while
the math is ok in your checks (modulo the = case), it's a kind of premature
optimization that hides the problems you're addressing.

>  It's especially subtle that this check allows overflow
> and then you check for that later,

no, actually that particular check was for addressing one particular
problem (to show you how to formulate it that makes intent clear), as a
train of thought if you like. the final proposed version (which has been
in PaX) has the int overflow check first and then the mmap_min_addr check.

> rather than the formulation I gave
> where no overflow is possible and it's immediately obvious why. 

you're only focused on the math and let the human intent slip by, that's
the issue i have with your checks (and you didn't get the = corner case
right btw, it should be allowed, nor refused if you want to be pedantic).

really, in security clarity of intent and its manifestation in code takes
precedence over optimization, let the compiler do the latter job.

> > which looks even better if done in shift_arg_pages as i've done it in PaX:
> 
> My change to setup_arg_pages is consistent with the existing
> CONFIG_STACK_GROWSUP code.

the GROWSUP code doesn't use unlikely (very confusing to see it in this
context since you're not addressing any hot path/performance issues), nor
does it care about mmap_min_addr (whereas both of your checks do which
depending on the generated asm, may even be racy when mmap_min_addr changes)
nor does it check for any int overflows. about the only similarity i can
see is its computation of 'vm_size' and an associated size check (properly
commented unlike your code) but it's normal there, whereas the problems you're
addressing are fundamentally vma address related, not vma size related.

> IMHO simple fixes should go in first
> and other restructuring of the code can be done later.

yours is not a simple fix as it doesn't make it clear what two problems
you're trying to address with your checks.

> > > +		return -ENOMEM;
> > 
> > i'd say EFAULT is more consistent, ENOMEM is used for cases when the kernel
> > couldn't allocate physical memory for something, EFAULT is when there's some
> > issue with the address space itself (based on the reaction to find_vma or
> > expand_stack failures).
> 
> I disagree.

well, in that case the EFAULT return codes in this function need some
cleanup too, you know, in the name of consistency ;).

> > also what about the BUG_ON in shift_arg_pages? it could go now, right?
> 
> It is now impossible by the logic of the arithmetic, yes.  But it is
> another local sanity check asserting the assumptions of the code in that
> function.  So there is no reason to 

there is no reason for it to be a BUG_ON (never was in fact), the kernel
can easily recover from the detected condition.

cheers,

 PaX Team

PS: i thought it was common courtesy and customary to credit people in the
commit who found/reported/fixed problems.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-11 13:39                   ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-11 13:39 UTC (permalink / raw)
  To: KOSAKI Motohiro, Roland McGrath
  Cc: Brad Spengler, Linus Torvalds, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, Eugene Teo

On 10 Sep 2010 at 2:25, Roland McGrath wrote:

> > Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> > by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> > can't change this even though you disliked. That said, we can't break
> > userland even though userland library is very crazy.
> 
> I'm sorry you think it's "very crazy" to implement the required
> functionality in the only way available.  POSIX requires that execve
> fail with E2BIG when the ARG_MAX limit is exceeded.  sysconf has to
> return the correct actual limit that execve will enforce so that a
> conforming application knows how much it can safely attempt to use.
> Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
> not expose the true manifest limit any other way, sysconf has to
> parallel the kernel's calculation.

no it doesn't have to, similarly to how it doesn't have to hardcode
_SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
and i think AT_ARGMAX could exist just as well.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-11 13:39                   ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-11 13:39 UTC (permalink / raw)
  To: KOSAKI Motohiro, Roland McGrath
  Cc: Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo

On 10 Sep 2010 at 2:25, Roland McGrath wrote:

> > Brad, sorry, I have bad news. glibc sysconf(_SC_ARG_MAX) is implemented
> > by hard coded RLIMIT_STACK/4 heuristics. That said, at least _now_, we
> > can't change this even though you disliked. That said, we can't break
> > userland even though userland library is very crazy.
> 
> I'm sorry you think it's "very crazy" to implement the required
> functionality in the only way available.  POSIX requires that execve
> fail with E2BIG when the ARG_MAX limit is exceeded.  sysconf has to
> return the correct actual limit that execve will enforce so that a
> conforming application knows how much it can safely attempt to use.
> Since the kernel uses the hard-coded RLIMIT_STACK/4 heuristic and does
> not expose the true manifest limit any other way, sysconf has to
> parallel the kernel's calculation.

no it doesn't have to, similarly to how it doesn't have to hardcode
_SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
and i think AT_ARGMAX could exist just as well.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/2] execve: check the VM has enough memory at first
  2010-09-10 15:06               ` Linus Torvalds
@ 2010-09-14  1:52                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  1:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Roland McGrath, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> > After this patch, execve() expand stack at first and receive to
> > check vm_enough_memory() properly. then, too long argument of
> > execve() than the machine memory return EFAULT properly.
> 
> This is horrible. We don't want to walk the arguments one more time
> just for this. Let's just improve the checks that we do as we go
> along.
> 
>                             Linus

Okey. I'll consider new way in this night.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-14  1:52                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14  1:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Roland McGrath, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> > After this patch, execve() expand stack at first and receive to
> > check vm_enough_memory() properly. then, too long argument of
> > execve() than the machine memory return EFAULT properly.
> 
> This is horrible. We don't want to walk the arguments one more time
> just for this. Let's just improve the checks that we do as we go
> along.
> 
>                             Linus

Okey. I'll consider new way in this night.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 18:51                     ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 18:51 UTC (permalink / raw)
  To: pageexec
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo

> no it doesn't have to, similarly to how it doesn't have to hardcode
> _SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
> and i think AT_ARGMAX could exist just as well.

I was referring to the ways available to userland heretofore.  Certainly,
the kernel could add new ways and then userland could do different things
(with new kernels).  

auxv in particular is not a mechanism that could fit for this.  The actual
limit depends on rlimits of the calling process, and rlimits can change
during the life of the program.  auxv is only appropriate for things that
are known at the time of the exec and won't change thereafter.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 18:51                     ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 18:51 UTC (permalink / raw)
  To: pageexec-Y8qEzhMunLyT9ig0jae3mg
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo

> no it doesn't have to, similarly to how it doesn't have to hardcode
> _SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
> and i think AT_ARGMAX could exist just as well.

I was referring to the ways available to userland heretofore.  Certainly,
the kernel could add new ways and then userland could do different things
(with new kernels).  

auxv in particular is not a mechanism that could fit for this.  The actual
limit depends on rlimits of the calling process, and rlimits can change
during the life of the program.  auxv is only appropriate for things that
are known at the time of the exec and won't change thereafter.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
  2010-09-11 13:30                 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  (?)
@ 2010-09-14 19:33                 ` Roland McGrath
  2010-09-14 22:35                     ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  -1 siblings, 1 reply; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 19:33 UTC (permalink / raw)
  To: pageexec
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	KOSAKI Motohiro, Neil Horman, linux-fsdevel, Brad Spengler,
	Eugene Teo

Readability and obviousness are in the eye of the beholder.   
Linus has merged my patches, so this crash is now fixed.  
As I said before, you should by all means send further clean-up
patches if you think they are improvements.

I have no special interest in this area.  I gave my opinions and
advice because I was CC'd, and then submitted some patches myself
because I was asked to.  I can't really keep track of all the
formalities.  The CC list on my submissions represented the set
of people I knew to be involved in the discussion.  I'm sorry if
you felt slighted.  I just pay attention to the code, not the
personalities.  The most reliable way to make sure your name is
associated with a change is to submit patches yourself in a form
that Linus wants to merge.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 20:28                       ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-14 20:28 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo

On 14 Sep 2010 at 11:51, Roland McGrath wrote:

> > no it doesn't have to, similarly to how it doesn't have to hardcode
> > _SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
> > and i think AT_ARGMAX could exist just as well.
> 
> I was referring to the ways available to userland heretofore.  Certainly,
> the kernel could add new ways and then userland could do different things
> (with new kernels).  
> 
> auxv in particular is not a mechanism that could fit for this.  The actual
> limit depends on rlimits of the calling process, and rlimits can change
> during the life of the program.

obviously an AT_ARGMAX computed at execve time would be based on the rlimits
as well and if later userland changed the rlimits, it'd be userland's problem,
not that of the kernel (or the kernel could refuse a change that would violate
its earlier promise).

>  auxv is only appropriate for things that
> are known at the time of the exec and won't change thereafter.

you mean stuff like AT_EUID et al.? ;)


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 20:28                       ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-14 20:28 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo

On 14 Sep 2010 at 11:51, Roland McGrath wrote:

> > no it doesn't have to, similarly to how it doesn't have to hardcode
> > _SC_PAGESIZE either, AT_PAGESZ tells userland what it needs to know
> > and i think AT_ARGMAX could exist just as well.
> 
> I was referring to the ways available to userland heretofore.  Certainly,
> the kernel could add new ways and then userland could do different things
> (with new kernels).  
> 
> auxv in particular is not a mechanism that could fit for this.  The actual
> limit depends on rlimits of the calling process, and rlimits can change
> during the life of the program.

obviously an AT_ARGMAX computed at execve time would be based on the rlimits
as well and if later userland changed the rlimits, it'd be userland's problem,
not that of the kernel (or the kernel could refuse a change that would violate
its earlier promise).

>  auxv is only appropriate for things that
> are known at the time of the exec and won't change thereafter.

you mean stuff like AT_EUID et al.? ;)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 21:16                         ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 21:16 UTC (permalink / raw)
  To: pageexec
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo

> obviously an AT_ARGMAX computed at execve time would be based on the rlimits
> as well and if later userland changed the rlimits, it'd be userland's problem,
> not that of the kernel (or the kernel could refuse a change that would violate
> its earlier promise).

This would thoroughly defeat the purpose of adding the thing.  The
only reason to have a new thing is so that userland does not have to
mirror the kernel's policy (as it attempts to do now, with the 1/4
calculation).  If the new thing is not something that userland can
use consistently so as not to have to know what the kernel's actual
policy is, then I don't see the point of it at all.

> >  auxv is only appropriate for things that
> > are known at the time of the exec and won't change thereafter.
> 
> you mean stuff like AT_EUID et al.? ;)

The information that these give is about the conditions at startup.
That's what they mean to userland, and userland only uses them to know
the situation before it has made any calls.  The definition of AT_EUID
is "effective user ID at program startup", and that fact does not
change.  You proposed AT_ARGMAX for a purpose that requires knowing
the current information in the process at the time it might attempt an
execve call, not at startup.  It is not an equivalent case.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 21:16                         ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 21:16 UTC (permalink / raw)
  To: pageexec-Y8qEzhMunLyT9ig0jae3mg
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo

> obviously an AT_ARGMAX computed at execve time would be based on the rlimits
> as well and if later userland changed the rlimits, it'd be userland's problem,
> not that of the kernel (or the kernel could refuse a change that would violate
> its earlier promise).

This would thoroughly defeat the purpose of adding the thing.  The
only reason to have a new thing is so that userland does not have to
mirror the kernel's policy (as it attempts to do now, with the 1/4
calculation).  If the new thing is not something that userland can
use consistently so as not to have to know what the kernel's actual
policy is, then I don't see the point of it at all.

> >  auxv is only appropriate for things that
> > are known at the time of the exec and won't change thereafter.
> 
> you mean stuff like AT_EUID et al.? ;)

The information that these give is about the conditions at startup.
That's what they mean to userland, and userland only uses them to know
the situation before it has made any calls.  The definition of AT_EUID
is "effective user ID at program startup", and that fact does not
change.  You proposed AT_ARGMAX for a purpose that requires knowing
the current information in the process at the time it might attempt an
execve call, not at startup.  It is not an equivalent case.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 22:27                           ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-14 22:27 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo

On 14 Sep 2010 at 14:16, Roland McGrath wrote:

> > obviously an AT_ARGMAX computed at execve time would be based on the rlimits
> > as well and if later userland changed the rlimits, it'd be userland's problem,
> > not that of the kernel (or the kernel could refuse a change that would violate
> > its earlier promise).
> 
> This would thoroughly defeat the purpose of adding the thing.  The
> only reason to have a new thing is so that userland does not have to
> mirror the kernel's policy (as it attempts to do now, with the 1/4
> calculation).  If the new thing is not something that userland can
> use consistently so as not to have to know what the kernel's actual
> policy is, then I don't see the point of it at all.

userland could never rely on the kernel's policy at all since get_arg_page
could have failed for more reasons than overstepping the currently hardcoded
ARG_MAX check in there. so what AT_ARGMAX would buy us is to allow the kernel
policy to change over time, but it's never been about guarantees, whether
POSIX wants such a thing or not.

> > >  auxv is only appropriate for things that
> > > are known at the time of the exec and won't change thereafter.
> > 
> > you mean stuff like AT_EUID et al.? ;)
> 
> The information that these give is about the conditions at startup.
> That's what they mean to userland, and userland only uses them to know
> the situation before it has made any calls.  The definition of AT_EUID
> is "effective user ID at program startup", and that fact does not
> change.

just for my own curiosity, where does this definition come from?


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 22:27                           ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-14 22:27 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo

On 14 Sep 2010 at 14:16, Roland McGrath wrote:

> > obviously an AT_ARGMAX computed at execve time would be based on the rlimits
> > as well and if later userland changed the rlimits, it'd be userland's problem,
> > not that of the kernel (or the kernel could refuse a change that would violate
> > its earlier promise).
> 
> This would thoroughly defeat the purpose of adding the thing.  The
> only reason to have a new thing is so that userland does not have to
> mirror the kernel's policy (as it attempts to do now, with the 1/4
> calculation).  If the new thing is not something that userland can
> use consistently so as not to have to know what the kernel's actual
> policy is, then I don't see the point of it at all.

userland could never rely on the kernel's policy at all since get_arg_page
could have failed for more reasons than overstepping the currently hardcoded
ARG_MAX check in there. so what AT_ARGMAX would buy us is to allow the kernel
policy to change over time, but it's never been about guarantees, whether
POSIX wants such a thing or not.

> > >  auxv is only appropriate for things that
> > > are known at the time of the exec and won't change thereafter.
> > 
> > you mean stuff like AT_EUID et al.? ;)
> 
> The information that these give is about the conditions at startup.
> That's what they mean to userland, and userland only uses them to know
> the situation before it has made any calls.  The definition of AT_EUID
> is "effective user ID at program startup", and that fact does not
> change.

just for my own curiosity, where does this definition come from?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 22:35                     ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-14 22:35 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	KOSAKI Motohiro, Neil Horman, linux-fsdevel, Brad Spengler,
	Eugene Teo

On 14 Sep 2010 at 12:33, Roland McGrath wrote:

> I have no special interest in this area.  I gave my opinions and
> advice because I was CC'd, and then submitted some patches myself
> because I was asked to.  I can't really keep track of all the
> formalities.  The CC list on my submissions represented the set
> of people I knew to be involved in the discussion.  I'm sorry if
> you felt slighted.  I just pay attention to the code, not the
> personalities.

i was referring to Brad, since he reported this (and the wireless
heap infoleak which he also wasn't credited for). anyway, we won't be
submitting any further vulnerability reports, so no need to worry about
any 'formalities' in the future.

>  The most reliable way to make sure your name is
> associated with a change is to submit patches yourself in a form
> that Linus wants to merge.

his conditions are unacceptable and therefore it's not going to happen.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 22:35                     ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-14 22:35 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, KOSAKI Motohiro, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Brad Spengler, Eugene Teo

On 14 Sep 2010 at 12:33, Roland McGrath wrote:

> I have no special interest in this area.  I gave my opinions and
> advice because I was CC'd, and then submitted some patches myself
> because I was asked to.  I can't really keep track of all the
> formalities.  The CC list on my submissions represented the set
> of people I knew to be involved in the discussion.  I'm sorry if
> you felt slighted.  I just pay attention to the code, not the
> personalities.

i was referring to Brad, since he reported this (and the wireless
heap infoleak which he also wasn't credited for). anyway, we won't be
submitting any further vulnerability reports, so no need to worry about
any 'formalities' in the future.

>  The most reliable way to make sure your name is
> associated with a change is to submit patches yourself in a form
> that Linus wants to merge.

his conditions are unacceptable and therefore it's not going to happen.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 23:04                             ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 23:04 UTC (permalink / raw)
  To: pageexec
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo

> userland could never rely on the kernel's policy at all since get_arg_page
> could have failed for more reasons than overstepping the currently hardcoded
> ARG_MAX check in there. 

I don't see how it could fail except for OOM cases where get_user_pages()
failed rather than blocking.  Is that what you mean?

> so what AT_ARGMAX would buy us is to allow the kernel
> policy to change over time, but it's never been about guarantees, whether
> POSIX wants such a thing or not.

I understand the motivation for an explicit mechanism for the kernel to
tell userland its limit.  Since the kernel policy today depends on
something that can change between execs, AT_ARGMAX is inadequate for
that purpose for today's policy, let alone any future different policy.

> > The information that these give is about the conditions at startup.
> > That's what they mean to userland, and userland only uses them to know
> > the situation before it has made any calls.  The definition of AT_EUID
> > is "effective user ID at program startup", and that fact does not
> > change.
> 
> just for my own curiosity, where does this definition come from?

You mean documentation?  I'm not really sure if there is any for that.
But it's the inherent definition of auxv that all its information can
only be about the conditions at program startup.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-14 23:04                             ` Roland McGrath
  0 siblings, 0 replies; 81+ messages in thread
From: Roland McGrath @ 2010-09-14 23:04 UTC (permalink / raw)
  To: pageexec-Y8qEzhMunLyT9ig0jae3mg
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo

> userland could never rely on the kernel's policy at all since get_arg_page
> could have failed for more reasons than overstepping the currently hardcoded
> ARG_MAX check in there. 

I don't see how it could fail except for OOM cases where get_user_pages()
failed rather than blocking.  Is that what you mean?

> so what AT_ARGMAX would buy us is to allow the kernel
> policy to change over time, but it's never been about guarantees, whether
> POSIX wants such a thing or not.

I understand the motivation for an explicit mechanism for the kernel to
tell userland its limit.  Since the kernel policy today depends on
something that can change between execs, AT_ARGMAX is inadequate for
that purpose for today's policy, let alone any future different policy.

> > The information that these give is about the conditions at startup.
> > That's what they mean to userland, and userland only uses them to know
> > the situation before it has made any calls.  The definition of AT_EUID
> > is "effective user ID at program startup", and that fact does not
> > change.
> 
> just for my own curiosity, where does this definition come from?

You mean documentation?  I'm not really sure if there is any for that.
But it's the inherent definition of auxv that all its information can
only be about the conditions at program startup.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-15  9:27                               ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec @ 2010-09-15  9:27 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Oleg Nesterov, Neil Horman, linux-fsdevel, Eugene Teo

On 14 Sep 2010 at 16:04, Roland McGrath wrote:

> > userland could never rely on the kernel's policy at all since get_arg_page
> > could have failed for more reasons than overstepping the currently hardcoded
> > ARG_MAX check in there. 
> 
> I don't see how it could fail except for OOM cases where get_user_pages()
> failed rather than blocking.  Is that what you mean?

yes but it's not only OOM (ENOMEM from some allocation), but it can be also
EPERM from LSM (if mmap_min_addr is set too high) or EFAULT from get_user_pages
(e.g., if VM_FAULT_HWPOISON was returned for a requested page).


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 1/3] setup_arg_pages: diagnose excessive argument size
@ 2010-09-15  9:27                               ` pageexec-Y8qEzhMunLyT9ig0jae3mg
  0 siblings, 0 replies; 81+ messages in thread
From: pageexec-Y8qEzhMunLyT9ig0jae3mg @ 2010-09-15  9:27 UTC (permalink / raw)
  To: Roland McGrath
  Cc: KOSAKI Motohiro, Brad Spengler, Linus Torvalds, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	oss-security-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Solar Designer,
	Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eugene Teo

On 14 Sep 2010 at 16:04, Roland McGrath wrote:

> > userland could never rely on the kernel's policy at all since get_arg_page
> > could have failed for more reasons than overstepping the currently hardcoded
> > ARG_MAX check in there. 
> 
> I don't see how it could fail except for OOM cases where get_user_pages()
> failed rather than blocking.  Is that what you mean?

yes but it's not only OOM (ENOMEM from some allocation), but it can be also
EPERM from LSM (if mmap_min_addr is set too high) or EFAULT from get_user_pages
(e.g., if VM_FAULT_HWPOISON was returned for a requested page).

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] move cred_guard_mutex from task_struct to signal_struct
  2010-09-10 17:24                   ` Oleg Nesterov
@ 2010-09-16  5:51                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16  5:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> On 09/10, KOSAKI Motohiro wrote:
> >
> > 1) moving cread_guard_mutex itself
> >    - no increase execve overhead
> > 	-> very good
> >    - it also prevent parallel ptrace
> 
> No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
> pointed out it also needs write_lock(tasklist) which is worse. So
> this change doesn't make any practical harm for ptrace.

I see, thanks.

> 
> > 2) move in_exec_mm to signal_struct too
> >    -> very hard. oom-killer can use very few lock because it's called
> >       from various place. now both ->mm and ->in_exec_mm are protected
> >       task_lock() and it help to avoid messy.
> 
> Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
> you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
> anyway, this means it can't race with de_thread() changing the leader.
> But up to you.

Good idea. will fix.

> 
> Another very minor nit (but again, up to you). Perhaps exec_mmap()
> could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
> matter), it takes task_lock(current) anyway (and at this point current
> is always the group leader).

Thanks. will fix.


> 
> > Let's move ->cred_guard_mutex from task_struct to signal_struct. It
> > naturally prevent multiple-threads-inside-exec.
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 
> 
> This is very minor, but perhaps you can also fix a couple of comments
> which mention task->cred_guard_mutex,
> 
> 	fs/exec.c:1109		the caller must hold current->cred_guard_mutex
> 	kernel/cred.c:328	The caller must hold current->cred_guard_mutex
> 	include/linux/tracehook.h:153	@task->cred_guard_mutex

Will fix, of cource.





^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] move cred_guard_mutex from task_struct to signal_struct
@ 2010-09-16  5:51                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16  5:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Roland McGrath, Linus Torvalds, Andrew Morton,
	linux-kernel, oss-security, Solar Designer, Kees Cook, Al Viro,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> On 09/10, KOSAKI Motohiro wrote:
> >
> > 1) moving cread_guard_mutex itself
> >    - no increase execve overhead
> > 	-> very good
> >    - it also prevent parallel ptrace
> 
> No, it doesn't. Only PTRACE_ATTACH needs this mutex, and as Roland
> pointed out it also needs write_lock(tasklist) which is worse. So
> this change doesn't make any practical harm for ptrace.

I see, thanks.

> 
> > 2) move in_exec_mm to signal_struct too
> >    -> very hard. oom-killer can use very few lock because it's called
> >       from various place. now both ->mm and ->in_exec_mm are protected
> >       task_lock() and it help to avoid messy.
> 
> Yes. But, if ->in_exec_mm is only used by oom_badness(), then I think
> you can use task_lock(tsk->group_leader). oom_badness() needs tasklist
> anyway, this means it can't race with de_thread() changing the leader.
> But up to you.

Good idea. will fix.

> 
> Another very minor nit (but again, up to you). Perhaps exec_mmap()
> could clear ->in_exec_mm (in task_struct or signal_struct, this doesnt
> matter), it takes task_lock(current) anyway (and at this point current
> is always the group leader).

Thanks. will fix.


> 
> > Let's move ->cred_guard_mutex from task_struct to signal_struct. It
> > naturally prevent multiple-threads-inside-exec.
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 
> 
> This is very minor, but perhaps you can also fix a couple of comments
> which mention task->cred_guard_mutex,
> 
> 	fs/exec.c:1109		the caller must hold current->cred_guard_mutex
> 	kernel/cred.c:328	The caller must hold current->cred_guard_mutex
> 	include/linux/tracehook.h:153	@task->cred_guard_mutex

Will fix, of cource.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/2] execve: check the VM has enough memory at first
  2010-09-14  1:52                 ` KOSAKI Motohiro
@ 2010-09-16  5:51                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16  5:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Roland McGrath, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> > On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > >
> > > After this patch, execve() expand stack at first and receive to
> > > check vm_enough_memory() properly. then, too long argument of
> > > execve() than the machine memory return EFAULT properly.
> > 
> > This is horrible. We don't want to walk the arguments one more time
> > just for this. Let's just improve the checks that we do as we go
> > along.
> > 
> >                             Linus
> 
> Okey. I'll consider new way in this night.

After while thinking, I decided to just drop this idea. because
 1) If one pass check is must, we can't reuse vm-overcommit check.
 2) Glibc has the duplicated hueristic, then we can't change it nor
    introduce new hard limit. (Sh*t)
 3) This is not must fix, it only mitigate a pain when accidental large
    argv case. Only OOM fixes enough care intended attack case.
 4) distro can change default of rlim_max of RLIMIT_STACK. It protect
    from RLIM_INFINITY smash.

Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-16  5:51                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 81+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16  5:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Roland McGrath, Andrew Morton, linux-kernel,
	oss-security, Solar Designer, Kees Cook, Al Viro, Oleg Nesterov,
	Neil Horman, linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

> > On Wed, Sep 8, 2010 at 10:04 PM, KOSAKI Motohiro
> > <kosaki.motohiro@jp.fujitsu.com> wrote:
> > >
> > > After this patch, execve() expand stack at first and receive to
> > > check vm_enough_memory() properly. then, too long argument of
> > > execve() than the machine memory return EFAULT properly.
> > 
> > This is horrible. We don't want to walk the arguments one more time
> > just for this. Let's just improve the checks that we do as we go
> > along.
> > 
> >                             Linus
> 
> Okey. I'll consider new way in this night.

After while thinking, I decided to just drop this idea. because
 1) If one pass check is must, we can't reuse vm-overcommit check.
 2) Glibc has the duplicated hueristic, then we can't change it nor
    introduce new hard limit. (Sh*t)
 3) This is not must fix, it only mitigate a pain when accidental large
    argv case. Only OOM fixes enough care intended attack case.
 4) distro can change default of rlim_max of RLIMIT_STACK. It protect
    from RLIM_INFINITY smash.

Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/2] execve: check the VM has enough memory at first
  2010-09-16  5:51                   ` KOSAKI Motohiro
@ 2010-09-16 15:01                     ` Linus Torvalds
  -1 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2010-09-16 15:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

2010/9/15 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>
> Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.

Well, I mostly agree. That said, I do think we could extend the
limiter some ways.

For example, I think the "stack limit / 4" is perfectly sane, but it
would make total sense to perhaps also take into account the AS and
RSS limits.

And I do think that your attempt to use __vm_enough_memory() was good.
It happens to be coded in a way that makes it useless for a one-pass
model, and some of what it does would be too expensive to do up-front
when you can't short-circuit it, but I do think that it would probably
be appropriate to at least try to take the _rough_ code there and use
it as a limit for maximum stack size too.

For example, we could have a function somewhat like

    unsigned long max_stack_size(void)
   {
        unsigned long allowed, used, limit;

        switch (sysctl_overcommit_memory) {
        case OVERCOMMIT_ALWAYS:
                allowed = ULONG_MAX;
                break;
        case OVERCOMMIT_GUESS:
                .. maybe we can come up with some upper bound here too ..
                break;
        default:
                allowed = (totalram_pages - hugetlb_total_pages())
                        * sysctl_overcommit_ratio / 100;
                if (!cap_sys_admin)
                        allowed -= allowed / 32;
                allowed += total_swap_pages;
                /* Don't let a single process grow too big:
                   leave 3% of the size of this process for other processes */
                if (mm)
                        allowed -= mm->total_vm / 32;
                /* What is already committed to? */
                used = percpu_counter_read_positive(&vm_committed_as);
                if (used > allowed)
                        return 0;
                allowed -= used;
                break;
        }
        limit = ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4;
        if (allowed > limit)
                allowed = limit;
        return allowed;
    }

which we'd call once at the beginning of the execve(), and then
remember that result and use it instead of the current 'rlimit/4'
value.

Now, admittedly the OVERCOMMIT_GUESS case is the interesting one, and
the one that is hard to write efficiently. But maybe we could make
'nr_free_pages()' cheap enough that doin that whole OVERCOMMIT_GUESS
"approximate free pages" thing from __vm_enough_memory would work out
too?

I dunno. It doesn't look hopeless.

                      Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 2/2] execve: check the VM has enough memory at first
@ 2010-09-16 15:01                     ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2010-09-16 15:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Roland McGrath, Andrew Morton, linux-kernel, oss-security,
	Solar Designer, Kees Cook, Al Viro, Oleg Nesterov, Neil Horman,
	linux-fsdevel, pageexec,
	Brad Spengler <spender@grsecurity.net>,
	Eugene Teo, KAMEZAWA Hiroyuki

2010/9/15 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
>
> Briefly says, to introduce new limit has bad benefit/risk balance. Sadly.

Well, I mostly agree. That said, I do think we could extend the
limiter some ways.

For example, I think the "stack limit / 4" is perfectly sane, but it
would make total sense to perhaps also take into account the AS and
RSS limits.

And I do think that your attempt to use __vm_enough_memory() was good.
It happens to be coded in a way that makes it useless for a one-pass
model, and some of what it does would be too expensive to do up-front
when you can't short-circuit it, but I do think that it would probably
be appropriate to at least try to take the _rough_ code there and use
it as a limit for maximum stack size too.

For example, we could have a function somewhat like

    unsigned long max_stack_size(void)
   {
        unsigned long allowed, used, limit;

        switch (sysctl_overcommit_memory) {
        case OVERCOMMIT_ALWAYS:
                allowed = ULONG_MAX;
                break;
        case OVERCOMMIT_GUESS:
                .. maybe we can come up with some upper bound here too ..
                break;
        default:
                allowed = (totalram_pages - hugetlb_total_pages())
                        * sysctl_overcommit_ratio / 100;
                if (!cap_sys_admin)
                        allowed -= allowed / 32;
                allowed += total_swap_pages;
                /* Don't let a single process grow too big:
                   leave 3% of the size of this process for other processes */
                if (mm)
                        allowed -= mm->total_vm / 32;
                /* What is already committed to? */
                used = percpu_counter_read_positive(&vm_committed_as);
                if (used > allowed)
                        return 0;
                allowed -= used;
                break;
        }
        limit = ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4;
        if (allowed > limit)
                allowed = limit;
        return allowed;
    }

which we'd call once at the beginning of the execve(), and then
remember that result and use it instead of the current 'rlimit/4'
value.

Now, admittedly the OVERCOMMIT_GUESS case is the interesting one, and
the one that is hard to write efficiently. But maybe we could make
'nr_free_pages()' cheap enough that doin that whole OVERCOMMIT_GUESS
"approximate free pages" thing from __vm_enough_memory would work out
too?

I dunno. It doesn't look hopeless.

                      Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2010-09-16 15:02 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-27 22:02 [PATCH] exec argument expansion can inappropriately trigger OOM-killer Kees Cook
2010-08-30  0:19 ` KOSAKI Motohiro
2010-08-30  0:56 ` Roland McGrath
2010-08-30  3:23   ` Solar Designer
2010-08-30  3:23     ` Solar Designer
2010-08-30 10:06     ` Roland McGrath
2010-08-30 19:48       ` Solar Designer
2010-08-31  0:40         ` Roland McGrath
2010-09-08  2:34       ` [PATCH 0/3] execve argument-copying fixes Roland McGrath
2010-09-08  2:34         ` Roland McGrath
2010-09-08  2:35         ` [PATCH 1/3] setup_arg_pages: diagnose excessive argument size Roland McGrath
2010-09-08  2:35           ` Roland McGrath
2010-09-08  8:29           ` pageexec
2010-09-08  8:29             ` pageexec
2010-09-10  8:59             ` Roland McGrath
2010-09-10  8:59               ` Roland McGrath
2010-09-11 13:30               ` pageexec
2010-09-11 13:30                 ` pageexec-Y8qEzhMunLyT9ig0jae3mg
2010-09-14 19:33                 ` Roland McGrath
2010-09-14 22:35                   ` pageexec
2010-09-14 22:35                     ` pageexec-Y8qEzhMunLyT9ig0jae3mg
2010-09-08 11:57           ` Brad Spengler
2010-09-08 11:57             ` Brad Spengler
2010-09-09  5:31             ` KOSAKI Motohiro
2010-09-09  5:31               ` KOSAKI Motohiro
2010-09-10  9:25               ` Roland McGrath
2010-09-10  9:25                 ` Roland McGrath
2010-09-10  9:43                 ` KOSAKI Motohiro
2010-09-11 13:39                 ` pageexec
2010-09-11 13:39                   ` pageexec-Y8qEzhMunLyT9ig0jae3mg
2010-09-14 18:51                   ` Roland McGrath
2010-09-14 18:51                     ` Roland McGrath
2010-09-14 20:28                     ` pageexec
2010-09-14 20:28                       ` pageexec-Y8qEzhMunLyT9ig0jae3mg
2010-09-14 21:16                       ` Roland McGrath
2010-09-14 21:16                         ` Roland McGrath
2010-09-14 22:27                         ` pageexec
2010-09-14 22:27                           ` pageexec-Y8qEzhMunLyT9ig0jae3mg
2010-09-14 23:04                           ` Roland McGrath
2010-09-14 23:04                             ` Roland McGrath
2010-09-15  9:27                             ` pageexec
2010-09-15  9:27                               ` pageexec-Y8qEzhMunLyT9ig0jae3mg
2010-09-10  9:18             ` Roland McGrath
2010-09-10  9:18               ` Roland McGrath
2010-09-08  2:36         ` [PATCH 2/3] execve: improve interactivity with large arguments Roland McGrath
2010-09-08  2:36           ` Roland McGrath
2010-09-08  2:37         ` [PATCH 3/3] execve: make responsive to SIGKILL " Roland McGrath
2010-09-08  2:37           ` Roland McGrath
2010-09-08  3:00         ` [PATCH 0/3] execve argument-copying fixes KOSAKI Motohiro
2010-09-08  3:00           ` KOSAKI Motohiro
2010-09-09  5:01         ` [PATCH 0/2] execve memory exhaust of " KOSAKI Motohiro
2010-09-09  5:01           ` KOSAKI Motohiro
2010-09-09  5:03           ` [PATCH 1/2] oom: don't ignore rss in nascent mm KOSAKI Motohiro
2010-09-09  5:03             ` KOSAKI Motohiro
2010-09-09 22:05             ` Oleg Nesterov
2010-09-09 22:05               ` Oleg Nesterov
2010-09-10  9:39               ` Roland McGrath
2010-09-10  9:39                 ` Roland McGrath
2010-09-10  9:57               ` [PATCH] move cred_guard_mutex from task_struct to signal_struct KOSAKI Motohiro
2010-09-10  9:57                 ` KOSAKI Motohiro
2010-09-10 17:24                 ` Oleg Nesterov
2010-09-10 17:24                   ` Oleg Nesterov
2010-09-16  5:51                   ` KOSAKI Motohiro
2010-09-16  5:51                     ` KOSAKI Motohiro
2010-09-09  5:04           ` [PATCH 2/2] execve: check the VM has enough memory at first KOSAKI Motohiro
2010-09-09  5:04             ` KOSAKI Motohiro
2010-09-10 15:06             ` Linus Torvalds
2010-09-10 15:06               ` Linus Torvalds
2010-09-14  1:52               ` KOSAKI Motohiro
2010-09-14  1:52                 ` KOSAKI Motohiro
2010-09-16  5:51                 ` KOSAKI Motohiro
2010-09-16  5:51                   ` KOSAKI Motohiro
2010-09-16 15:01                   ` Linus Torvalds
2010-09-16 15:01                     ` Linus Torvalds
2010-08-30 17:49     ` [PATCH] exec argument expansion can inappropriately trigger OOM-killer Solar Designer
2010-08-30 17:49       ` Solar Designer
2010-08-30 22:08       ` Brad Spengler
2010-08-30 22:08         ` Brad Spengler
2010-08-31 11:53         ` Solar Designer
2010-08-31 11:53           ` Solar Designer
2010-08-31 11:56         ` [PATCH] exec argument expansion can inappropriately triggerOOM-killer Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.