linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
@ 2017-06-28 17:52 Oleg Nesterov
  2017-06-28 17:52 ` [PATCH 1/1] " Oleg Nesterov
  2017-06-28 23:26 ` [PATCH 0/1] " Linus Torvalds
  0 siblings, 2 replies; 15+ messages in thread
From: Oleg Nesterov @ 2017-06-28 17:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Larry Woodman, Linus Torvalds, Michal Hocko, linux-kernel

See the patch, but actually I have another question...

Now that the stack-guard-page has gone, why do we need to allow to grow
into the previous VM_GROWSDOWN vma? IOW, why we can not simply remove
the VM_GROWSDOWN check in expand_downwards() ?

Yes, this is what the kernel did before the recent changes. But afaics
only because the kernel could not know if the vma->vm_start page is
actually guard or not.

IOW, iiuc before the recent change it was not simple to _disallow_ this,
and that is why it worked. Just for example, suppose an application does

	addr = mmap(MAP_GROWSDOWN);
	mprotect(addr, PAGE_SIZE, PROT_NONE);
	*(addr + PAGE_SIZE) = 0;

and of course this should not fail.

But the the kernel could not know if vm_start == addr + PAGE_SIZE is the
"valid" address, or this vma was expanded before and vm_start is the stack
guard.

Yes, we can probably check anon_vma's as the comment suggests, but imo we
we can just remove the VM_GROWSDOWN case unconditionally.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-28 17:52 [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev Oleg Nesterov
@ 2017-06-28 17:52 ` Oleg Nesterov
  2017-06-30 13:16   ` Michal Hocko
  2017-06-28 23:26 ` [PATCH 0/1] " Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2017-06-28 17:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Larry Woodman, Linus Torvalds, Michal Hocko, linux-kernel

expand_stack(vma) fails if address < stack_guard_gap even if there is no
vma->vm_prev. I don't think this makes sense, and we didn't do this before
the recent commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas").
We do not need a gap in this case, any address is fine as long as
security_mmap_addr() doesn't object.

This also simplifies the code, we know that address >= prev->vm_end and
thus underflow is not possible.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/mmap.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8e07976..5a8bd97 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2310,7 +2310,6 @@ int expand_downwards(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *prev;
-	unsigned long gap_addr;
 	int error;
 
 	address &= PAGE_MASK;
@@ -2319,14 +2318,11 @@ int expand_downwards(struct vm_area_struct *vma,
 		return error;
 
 	/* Enforce stack_guard_gap */
-	gap_addr = address - stack_guard_gap;
-	if (gap_addr > address)
-		return -ENOMEM;
 	prev = vma->vm_prev;
-	if (prev && prev->vm_end > gap_addr) {
-		if (!(prev->vm_flags & VM_GROWSDOWN))
+	/* Check that both stack segments have the same anon_vma? */
+	if (prev && !(prev->vm_flags & VM_GROWSDOWN)) {
+		if (address - prev->vm_end < stack_guard_gap)
 			return -ENOMEM;
-		/* Check that both stack segments have the same anon_vma? */
 	}
 
 	/* We must make sure the anon_vma is allocated. */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-28 17:52 [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev Oleg Nesterov
  2017-06-28 17:52 ` [PATCH 1/1] " Oleg Nesterov
@ 2017-06-28 23:26 ` Linus Torvalds
  2017-06-29 15:19   ` Oleg Nesterov
  2017-06-30 13:24   ` Michal Hocko
  1 sibling, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2017-06-28 23:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, Larry Woodman, Michal Hocko,
	Linux Kernel Mailing List

On Wed, Jun 28, 2017 at 10:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Now that the stack-guard-page has gone, why do we need to allow to grow
> into the previous VM_GROWSDOWN vma? IOW, why we can not simply remove
> the VM_GROWSDOWN check in expand_downwards() ?

Because the "prev" vma may actually be the original vma.

I think I described it in an earlier thread, but what happened at
least once was:

 - program has some part that uses a lot of stack for part of the
execution for some temp buffer or deep recursion or whatever

 - somebody noticed this, and decided to free up the no-longer-used
pages by doing a "munmap()" after the program was done with that part
of the stack

 - but the "munmap()" wasn't complete (maybe it only accounted for the
explicitly used buffer, whatever), so the munmap actually didn't just
remove the no-longer used bottom of the stack, it actually split the
stack segment into two (with a small remaining stack turd that was the
*real* bottom of the deep stack that used to exist)

 - you do want to be able to grow the stack back to what it was, and
it needs to be able to meet that old turd without triggering the
"oops, guard area!" logic.

As to your patch: I would prefer to actually keep the new failure
behavior of unconditionally breaking a big stack expansion), unless
there's an actual thing it breaks.

In fact, I'd even be quite open to adding a kernel warning about badly
behaved binaries that grow their stack by a big amount in one go. Not
only is it bad taste (and we really should encourage compilers to do
probing every page when growing the stack), but it migth be a sign of
an attempted attack.

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-28 23:26 ` [PATCH 0/1] " Linus Torvalds
@ 2017-06-29 15:19   ` Oleg Nesterov
  2017-06-29 18:21     ` Linus Torvalds
  2017-06-30 13:24   ` Michal Hocko
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2017-06-29 15:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Larry Woodman, Michal Hocko,
	Linux Kernel Mailing List

On 06/28, Linus Torvalds wrote:
>
> On Wed, Jun 28, 2017 at 10:52 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Now that the stack-guard-page has gone, why do we need to allow to grow
> > into the previous VM_GROWSDOWN vma? IOW, why we can not simply remove
> > the VM_GROWSDOWN check in expand_downwards() ?
>
> Because the "prev" vma may actually be the original vma.
>
> I think I described it in an earlier thread, but what happened at
> least once was:
>
>  - program has some part that uses a lot of stack for part of the
> execution for some temp buffer or deep recursion or whatever
>
>  - somebody noticed this, and decided to free up the no-longer-used
> pages by doing a "munmap()" after the program was done with that part
> of the stack
>
>  - but the "munmap()" wasn't complete (maybe it only accounted for the
> explicitly used buffer, whatever), so the munmap actually didn't just
> remove the no-longer used bottom of the stack, it actually split the
> stack segment into two (with a small remaining stack turd that was the
> *real* bottom of the deep stack that used to exist)

Ah, OK, thanks...



> As to your patch: I would prefer to actually keep the new failure
> behavior of unconditionally breaking a big stack expansion), unless
> there's an actual thing it breaks.

Hmm. May be you misread this patch? Or I misunderstood.

> In fact, I'd even be quite open to adding a kernel warning about badly
> behaved binaries that grow their stack by a big amount in one go.

Yes, but this is another story.

Currently expand_downwards(address) does

	if (address < stack_guard_gap)
		return -ENOMEM;

This has nothing to do with "by how much it needs to grow", this simply
forbids the bottom of stack below stack_guard_gap. Why?

I don't think this patch can make any difference in practice, it just
tries to make this logic more consistent/understandable.

For example. Suppose that stack_guard_gap = 1M (default). Now,

	addr = 512K;  // any addr <= stack_guard_gap;
	char *stack = mmap(addr, MAP_FIXED|MAP_GROWSDOWN, PAGE_SIZE);

	*stack = 0;
	stack -= PAGE_SIZE;
	*stack = 0;

The first store will always succeed, the 2nd one will always fail even
if (likely) there is no another vma below. This looks strange to me.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-29 15:19   ` Oleg Nesterov
@ 2017-06-29 18:21     ` Linus Torvalds
  2017-06-29 18:55       ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-06-29 18:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, Larry Woodman, Michal Hocko,
	Linux Kernel Mailing List

On Thu, Jun 29, 2017 at 8:19 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Hmm. May be you misread this patch?

Ahh, yes. I'm ok with your patch.

That said, you did remove something extra: the comment about

    /* Check that both stack segments have the same anon_vma? */

is actually still relevant wrt that VM_GROWSDOWN test. The issue is
that we could actually limit the VM_GROWSDOWN thing to only be ok with
merging with a previous vma only if it *really* was the same segment.

And I think we could do that by checking the anon-vma (or maybe the vm_offset?)

                  Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-29 18:21     ` Linus Torvalds
@ 2017-06-29 18:55       ` Oleg Nesterov
  2017-06-29 19:00         ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2017-06-29 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Larry Woodman, Michal Hocko,
	Linux Kernel Mailing List

On 06/29, Linus Torvalds wrote:
>
> On Thu, Jun 29, 2017 at 8:19 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Hmm. May be you misread this patch?
>
> Ahh, yes. I'm ok with your patch.
>
> That said, you did remove something extra: the comment about
>
>     /* Check that both stack segments have the same anon_vma? */

I didn't ;) I moved it up, right above VM_GROWSDOWN check.


> is actually still relevant wrt that VM_GROWSDOWN test. The issue is
> that we could actually limit the VM_GROWSDOWN thing to only be ok with
> merging with a previous vma only if it *really* was the same segment.

Yes, yes, this is clear. This comment motivated me to ask that question,
I thought that we probably do not need to reconcile the stacks even in
this case.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-29 18:55       ` Oleg Nesterov
@ 2017-06-29 19:00         ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2017-06-29 19:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, Larry Woodman, Michal Hocko,
	Linux Kernel Mailing List

On Thu, Jun 29, 2017 at 11:55 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I didn't ;) I moved it up, right above VM_GROWSDOWN check.

I think I will just need to take a long nap, I'm clearly not tracking
the patches very well.

                 Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-28 17:52 ` [PATCH 1/1] " Oleg Nesterov
@ 2017-06-30 13:16   ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2017-06-30 13:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, Larry Woodman, Linus Torvalds, linux-kernel

On Wed 28-06-17 19:52:58, Oleg Nesterov wrote:
> expand_stack(vma) fails if address < stack_guard_gap even if there is no
> vma->vm_prev. I don't think this makes sense, and we didn't do this before
> the recent commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas").
> We do not need a gap in this case, any address is fine as long as
> security_mmap_addr() doesn't object.
> 
> This also simplifies the code, we know that address >= prev->vm_end and
> thus underflow is not possible.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/mmap.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8e07976..5a8bd97 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2310,7 +2310,6 @@ int expand_downwards(struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct vm_area_struct *prev;
> -	unsigned long gap_addr;
>  	int error;
>  
>  	address &= PAGE_MASK;
> @@ -2319,14 +2318,11 @@ int expand_downwards(struct vm_area_struct *vma,
>  		return error;
>  
>  	/* Enforce stack_guard_gap */
> -	gap_addr = address - stack_guard_gap;
> -	if (gap_addr > address)
> -		return -ENOMEM;

I thought this was an underflow protection. address might be still above
min_mmap address while gap_addr can underflow, that would mean that we
might not detect a mapping in that range, but

>  	prev = vma->vm_prev;
> -	if (prev && prev->vm_end > gap_addr) {
> -		if (!(prev->vm_flags & VM_GROWSDOWN))
> +	/* Check that both stack segments have the same anon_vma? */
> +	if (prev && !(prev->vm_flags & VM_GROWSDOWN)) {
> +		if (address - prev->vm_end < stack_guard_gap)

this would handle that case properly so the problem wouldn't happen.

>  			return -ENOMEM;
> -		/* Check that both stack segments have the same anon_vma? */
>  	}
>  
>  	/* We must make sure the anon_vma is allocated. */
> -- 
> 2.5.0
> 
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-28 23:26 ` [PATCH 0/1] " Linus Torvalds
  2017-06-29 15:19   ` Oleg Nesterov
@ 2017-06-30 13:24   ` Michal Hocko
  2017-06-30 17:08     ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-06-30 13:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, Larry Woodman,
	Linux Kernel Mailing List

On Wed 28-06-17 16:26:33, Linus Torvalds wrote:
[...]
> In fact, I'd even be quite open to adding a kernel warning about badly
> behaved binaries that grow their stack by a big amount in one go. Not
> only is it bad taste (and we really should encourage compilers to do
> probing every page when growing the stack), but it migth be a sign of
> an attempted attack.

FWIW our gcc guys shown an interest in having something to tell the
kernel how much the stack can grow at once. They want it for testing of
the new stack probing alloca implementation. I have something
preliminary with /proc/<pid>/stack_expand_limit for the internal testing
purpose but maybe there will be more interest for this. I didn't plan to
post it public because it basically duplicates the stack_gap but it is
also true that it can help some applications which won't use the proposed
__save_alloca (or whatever it will be) without increasing stack_gap too
much.

What do you think?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-30 13:24   ` Michal Hocko
@ 2017-06-30 17:08     ` Linus Torvalds
  2017-06-30 17:26       ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-06-30 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, Larry Woodman,
	Linux Kernel Mailing List

On Fri, Jun 30, 2017 at 6:24 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> FWIW our gcc guys shown an interest in having something to tell the
> kernel how much the stack can grow at once. They want it for testing of
> the new stack probing alloca implementation.

Here, I made this just for them:

   #define STACK_GROWTH_SIZE (4096)

isn't that beautiful? A new kernel interface without some stupid sysfs
file for it.

And the added advantage is that it compiles to nice dense code too.

> I have something
> preliminary with /proc/<pid>/stack_expand_limit for the internal testing
> purpose but maybe there will be more interest for this. I

NO NO NO.

I absolutely refuse to see the stack gap as some kind of "this is how
much you can grow the stack without probing".

That's complete and utter garbage.

Tell them that they can grow the stack by 4kB. That's it. If they want
to get all fancy, and they say that they really want an
architecture-specific value, tell them to use getpagesize().

The stack gap is there due to the ABI being broken. If we're fixing
the ABI, then the stack gap has *nothing* to add.

Don't encourage shit.

                    Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-30 17:08     ` Linus Torvalds
@ 2017-06-30 17:26       ` Michal Hocko
  2017-06-30 17:48         ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-06-30 17:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, Larry Woodman,
	Linux Kernel Mailing List

On Fri 30-06-17 10:08:03, Linus Torvalds wrote:
> On Fri, Jun 30, 2017 at 6:24 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > FWIW our gcc guys shown an interest in having something to tell the
> > kernel how much the stack can grow at once. They want it for testing of
> > the new stack probing alloca implementation.
> 
> Here, I made this just for them:
> 
>    #define STACK_GROWTH_SIZE (4096)
> 
> isn't that beautiful? A new kernel interface without some stupid sysfs
> file for it.
> 
> And the added advantage is that it compiles to nice dense code too.
> 
> > I have something
> > preliminary with /proc/<pid>/stack_expand_limit for the internal testing
> > purpose but maybe there will be more interest for this. I
> 
> NO NO NO.
> 
> I absolutely refuse to see the stack gap as some kind of "this is how
> much you can grow the stack without probing".
> 
> That's complete and utter garbage.
> 
> Tell them that they can grow the stack by 4kB. That's it. If they want
> to get all fancy, and they say that they really want an
> architecture-specific value, tell them to use getpagesize().

Ohh, you misunderstood I guess. They wanted that only for internal
testing (e.g. make sure that everything that matters blows up if it is
doing something wrong). Absolutely nothing to base any compilator
decistion on.

> The stack gap is there due to the ABI being broken. If we're fixing
> the ABI, then the stack gap has *nothing* to add.

Yeah I know. The only usecase why I thought this might be interesting is
when you run the code you haven't compiled with a compiler which does
the proper thing. Think of all the 3rd party stuff that you eventually
_need_ to run. Then you have also a hard guess to tune your gap for. If
you can blow on/warn about unexpectedly large stack expansions then you
can protect that particular piece of SW.

But as I've said, this is not something I planned to post upstream but
you mentioning a warning made me think that I could just mention it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-30 17:26       ` Michal Hocko
@ 2017-06-30 17:48         ` Linus Torvalds
  2017-07-03 15:49           ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-06-30 17:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, Larry Woodman,
	Linux Kernel Mailing List

On Fri, Jun 30, 2017 at 10:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> Ohh, you misunderstood I guess. They wanted that only for internal
> testing (e.g. make sure that everything that matters blows up if it is
> doing something wrong). Absolutely nothing to base any compilator
> decistion on.

Oh, good.

If that's the case, I really think we should try to add some code that
checks that the stack grows strictly one page at a time, and have a
way to enable SIGSEGV if that is ever not the case.

That should be trivial to add in expand_downwards/expand_upwards.

We could make a "warn once" thing unconditional for distro testing,
but since compiler people would presumably want to test this before
the rest of the distro is clean, they'd need some rlimit or something
like that to enable it for particular processes.

Would that be ok for them?

Some prctl to get/set that "max I'm allowed to extend the stack"?

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-06-30 17:48         ` Linus Torvalds
@ 2017-07-03 15:49           ` Michal Hocko
  2017-07-03 16:30             ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-07-03 15:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, Larry Woodman,
	Linux Kernel Mailing List

On Fri 30-06-17 10:48:15, Linus Torvalds wrote:
> On Fri, Jun 30, 2017 at 10:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > Ohh, you misunderstood I guess. They wanted that only for internal
> > testing (e.g. make sure that everything that matters blows up if it is
> > doing something wrong). Absolutely nothing to base any compilator
> > decistion on.
> 
> Oh, good.
> 
> If that's the case, I really think we should try to add some code that
> checks that the stack grows strictly one page at a time, and have a
> way to enable SIGSEGV if that is ever not the case.
> 
> That should be trivial to add in expand_downwards/expand_upwards.

yes, but I would be worried to have this hardcoded. People very often do
run 3rd party code compiled with a non-default distribution compiler. I
also expect there are sensible usecases where probing on the stack could
lead to a performance degradation so people might want to explicitely
disable it.

> We could make a "warn once" thing unconditional for distro testing,
> but since compiler people would presumably want to test this before
> the rest of the distro is clean, they'd need some rlimit or something
> like that to enable it for particular processes.

yes, WARN_ONCE is not very practical to identify offenders..
 
> Would that be ok for them?
> 
> Some prctl to get/set that "max I'm allowed to extend the stack"?

prctl would be less code than proc interface I've had and quite
straightforward as well. Below is what I have ended up with for them. I
doesn't warn by default because I thought it would be too noisy without
stack probing used more widely.

If you think this is worth pursuing in upstream, just let me know and I
can polish it, add a patch for the man page and other things.
---
diff --git a/fs/exec.c b/fs/exec.c
index 904199086490..7589fb701fca 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -320,6 +320,13 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	arch_bprm_mm_init(mm, vma);
 	up_write(&mm->mmap_sem);
 	bprm->p = vma->vm_end - sizeof(void *);
+
+	/*
+	 * We cannot set the stack expand limit now because we will do manual
+	 * stack expansions which might be larger. See setup_arg_pages
+	 */
+	if (current->mm)
+		bprm->expand_stack_limit = current->mm->expand_stack_limit;
 	return 0;
 err:
 	up_write(&mm->mmap_sem);
@@ -786,6 +793,14 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	if (ret)
 		ret = -EFAULT;
 
+	/*
+	 * Stack is finilized now and so all later expansions have
+	 * to comply with the inherited limit.
+	 *
+	 * TODO: Do we want to allow non-privileged task to set
+	 * the limit for privileged ones?
+	 */
+	vma->vm_mm->expand_stack_limit = bprm->expand_stack_limit;
 out_unlock:
 	up_write(&mm->mmap_sem);
 	return ret;
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 05488da3aee9..2d3d7fff4811 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -17,6 +17,7 @@ struct linux_binprm {
 	char buf[BINPRM_BUF_SIZE];
 #ifdef CONFIG_MMU
 	struct vm_area_struct *vma;
+	unsigned long expand_stack_limit;
 	unsigned long vma_pages;
 #else
 # define MAX_ARG_PAGES	32
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..24ee38a45a9e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -425,6 +425,7 @@ struct mm_struct {
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
 
+	unsigned long expand_stack_limit;	/* how much we can grow stack at once */
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
 	/*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..6f3c65530c71 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,7 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+#define PR_SET_STACK_EXPAND_LIMIT 48
+#define PR_GET_STACK_EXPAND_LIMIT 49
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 4111d584ff4a..05bdcdeda98f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2290,6 +2290,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_FP_MODE:
 		error = GET_FP_MODE(me);
 		break;
+	case PR_SET_STACK_EXPAND_LIMIT:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		if (down_write_killable(&me->mm->mmap_sem))
+			return -EINTR;
+		me->mm->expand_stack_limit = arg2;
+		up_write(&me->mm->mmap_sem);
+		break;
+	case PR_GET_STACK_EXPAND_LIMIT:
+		error = me->mm->expand_stack_limit;
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/mm/mmap.c b/mm/mmap.c
index 5a0ba9788cdd..ff2a5981ee92 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2266,7 +2266,16 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		unsigned long size, grow;
 
 		size = address - vma->vm_start;
-		grow = (address - vma->vm_end) >> PAGE_SHIFT;
+		grow = address - vma->vm_end;
+
+		if (mm->expand_stack_limit && grow > mm->expand_stack_limit) {
+			pr_warn("%s[%d]: disallowed stack expansion %lu with limit %lu\n",
+					current->comm, task_pid_nr(current),
+					grow, mm->expand_stack_limit);
+			anon_vma_unlock_write(vma->anon_vma);
+			return -ENOMEM;
+		}
+		grow >>= PAGE_SHIFT;
 
 		error = -ENOMEM;
 		if (vma->vm_pgoff + (size >> PAGE_SHIFT) >= vma->vm_pgoff) {
@@ -2350,7 +2359,20 @@ int expand_downwards(struct vm_area_struct *vma,
 		unsigned long size, grow;
 
 		size = vma->vm_end - address;
-		grow = (vma->vm_start - address) >> PAGE_SHIFT;
+		grow = vma->vm_start - address;
+
+		/*
+		 * Make sure that a single stack expansion doesn't exceed the
+		 * configured limit.
+		 */
+		if (mm->expand_stack_limit && grow > mm->expand_stack_limit) {
+			pr_warn("%s[%d]: disallowed stack expansion %lu with limit %lu\n",
+					current->comm, task_pid_nr(current),
+					grow, mm->expand_stack_limit);
+			anon_vma_unlock_write(vma->anon_vma);
+			return -ENOMEM;
+		}
+		grow >>= PAGE_SHIFT;
 
 		error = -ENOMEM;
 		if (grow <= vma->vm_pgoff) {
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-07-03 15:49           ` Michal Hocko
@ 2017-07-03 16:30             ` Linus Torvalds
  2017-07-03 16:54               ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-07-03 16:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, Larry Woodman,
	Linux Kernel Mailing List

On Mon, Jul 3, 2017 at 8:49 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> If you think this is worth pursuing in upstream, just let me know and I
> can polish it, add a patch for the man page and other things.

Hmm. This doesn't look bad, except the bprm games there really look annoying.

Also, I'm wondering whether this should be per-thread - conceptually
"expand_stack()" really is a thread thing. All callers are using
"current", although it's not always obvious.

So I'm wondering if a slightly larger patch that simply made the
"limit" be an _argument_ to expand_stack() would clean up both of
these issues. The execve() use would simply pass in the stack limit,
and the fault users would pass in "current->expand_stack_limit".

Again, I'm not sure how many people really use multiple GROW_DOWN
stacks for threading, but it's conceptually the right thing to do, so
I think conceptually this should be per-thread. And the fact that it
might clean up the execve() thing makes me think it's the right thing
to do.

What do you think?

                 Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev
  2017-07-03 16:30             ` Linus Torvalds
@ 2017-07-03 16:54               ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2017-07-03 16:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, Larry Woodman,
	Linux Kernel Mailing List

On Mon 03-07-17 09:30:35, Linus Torvalds wrote:
> On Mon, Jul 3, 2017 at 8:49 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > If you think this is worth pursuing in upstream, just let me know and I
> > can polish it, add a patch for the man page and other things.
> 
> Hmm. This doesn't look bad, except the bprm games there really look annoying.
> 
> Also, I'm wondering whether this should be per-thread - conceptually
> "expand_stack()" really is a thread thing. All callers are using
> "current", although it's not always obvious.
> 
> So I'm wondering if a slightly larger patch that simply made the
> "limit" be an _argument_ to expand_stack() would clean up both of
> these issues. The execve() use would simply pass in the stack limit,
> and the fault users would pass in "current->expand_stack_limit".
> 
> Again, I'm not sure how many people really use multiple GROW_DOWN
> stacks for threading, but it's conceptually the right thing to do, so
> I think conceptually this should be per-thread. And the fact that it
> might clean up the execve() thing makes me think it's the right thing
> to do.
> 
> What do you think?

I am not sure about the per-thread vs. per mm part. If for nothing else,
MAP_GROWSDOWN can be something else than the main thread stack which can be
modified by all threads and then the semantic would be quite surprising
if different threads had a different idea about the expansion. No?

But an additional argument to expand_stack would surely clean things
up a bit and will get rid of the ugly bprm part as well. I will think
about it some more and then post the patch to linux-api to have a larger
audience.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-07-03 16:54 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-28 17:52 [PATCH 0/1] expand_downwards: don't require the gap if !vm_prev Oleg Nesterov
2017-06-28 17:52 ` [PATCH 1/1] " Oleg Nesterov
2017-06-30 13:16   ` Michal Hocko
2017-06-28 23:26 ` [PATCH 0/1] " Linus Torvalds
2017-06-29 15:19   ` Oleg Nesterov
2017-06-29 18:21     ` Linus Torvalds
2017-06-29 18:55       ` Oleg Nesterov
2017-06-29 19:00         ` Linus Torvalds
2017-06-30 13:24   ` Michal Hocko
2017-06-30 17:08     ` Linus Torvalds
2017-06-30 17:26       ` Michal Hocko
2017-06-30 17:48         ` Linus Torvalds
2017-07-03 15:49           ` Michal Hocko
2017-07-03 16:30             ` Linus Torvalds
2017-07-03 16:54               ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).