linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression: commit da029c11e6b1 broke toybox xargs.
@ 2017-11-01 23:34 Rob Landley
  2017-11-02  3:30 ` Kees Cook
       [not found] ` <CA+55aFyw74DcPygS=SB0d-Fufz3j73zTVp2UXUUOUt4=1_He=Q@mail.gmail.com>
  0 siblings, 2 replies; 15+ messages in thread
From: Rob Landley @ 2017-11-01 23:34 UTC (permalink / raw)
  To: Kees Cook, linux-kernel, toybox

Toybox has been trying to figure out how big an xargs is allowed to be
for a while:


http://lists.landley.net/pipermail/toybox-landley.net/2017-October/009186.html

We're trying to avoid the case where you can run something from the
command line, but not through xargs. In theory this limit is
sysconf(_SC_ARG_MAX) which on bionic and glibc returns 1/4 RLIMIT_STACK
(in accordance with <strike>the prophecy</strike> fs/exec.c function
get_arg_page()), but that turns out to be too simple. There's also a
131071 byte limit on each _individual_ argument, which I think I've
tracked down to fs/exec.c function setup_arg_pages() doing:

        stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages *

And then it worked under ubuntu 14.04 but not current kernels. Why?
Because the above commit from Kees Cook broke it, by taking this:

include/uapi/linux/resource.h:
/*
 * Limit the stack by to some sane default: root can always
 * increase this limit if needed..  8MB seems reasonable.
 */
#define _STK_LIM        (8*1024*1024)

And hardwiring in a random adjustment as a "640k ought to be enough for
anybody" constant on TOP of the existing RLIMIT_STACK/4 check. Without
even adjusting the "oh of course root can make this bigger, this is just
a default value" comment where it's #defined.

Look, if you want to cap RLIMIT_STACK for suid binaries, go for it. The
existing code will notice and adapt. But this new commit is crazy and
arbitrary and introduces more random version dependencies (how is
sysconf() supposed to know the value, an #if/else staircase based on
kernel version in every libc)?

Please revert it,

Rob

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-01 23:34 Regression: commit da029c11e6b1 broke toybox xargs Rob Landley
@ 2017-11-02  3:30 ` Kees Cook
       [not found] ` <CA+55aFyw74DcPygS=SB0d-Fufz3j73zTVp2UXUUOUt4=1_He=Q@mail.gmail.com>
  1 sibling, 0 replies; 15+ messages in thread
From: Kees Cook @ 2017-11-02  3:30 UTC (permalink / raw)
  To: Linus Torvalds, Rob Landley; +Cc: linux-kernel, toybox, Andy Lutomirski

On Wed, Nov 1, 2017 at 4:34 PM, Rob Landley <rob@landley.net> wrote:
> Toybox has been trying to figure out how big an xargs is allowed to be
> for a while:
>
> http://lists.landley.net/pipermail/toybox-landley.net/2017-October/009186.html
>
> We're trying to avoid the case where you can run something from the
> command line, but not through xargs. In theory this limit is
> sysconf(_SC_ARG_MAX) which on bionic and glibc returns 1/4 RLIMIT_STACK
> (in accordance with <strike>the prophecy</strike> fs/exec.c function
> get_arg_page()), but that turns out to be too simple. There's also a
> 131071 byte limit on each _individual_ argument, which I think I've
> tracked down to fs/exec.c function setup_arg_pages() doing:
>
>         stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages *
>
> And then it worked under ubuntu 14.04 but not current kernels. Why?
> Because the above commit from Kees Cook broke it, by taking this:
>
> include/uapi/linux/resource.h:
> /*
>  * Limit the stack by to some sane default: root can always
>  * increase this limit if needed..  8MB seems reasonable.
>  */
> #define _STK_LIM        (8*1024*1024)
>
> And hardwiring in a random adjustment as a "640k ought to be enough for
> anybody" constant on TOP of the existing RLIMIT_STACK/4 check. Without
> even adjusting the "oh of course root can make this bigger, this is just
> a default value" comment where it's #defined.
>
> Look, if you want to cap RLIMIT_STACK for suid binaries, go for it. The
> existing code will notice and adapt. But this new commit is crazy and
> arbitrary and introduces more random version dependencies (how is
> sysconf() supposed to know the value, an #if/else staircase based on
> kernel version in every libc)?
>
> Please revert it,

Hi Linus,

This is a report of userspace breakage due to:

commit da029c11e6b1 ("exec: Limit arg stack to at most 75% of _STK_LIM")

As a reminder to earlier discussions[1], it had been suggested that
this be setuid only, but you had asked that this be globally applied:

On Mon, Jul 10, 2017 at 11:24 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> But honestly, a security limit that isn't tested in normal working is
> not a security limit at all, it's just theory and likely bullshit. So
> I'd much rather *not* make it suid-specific if at all possible. That
> way it has some chance in hell of actually getting tested.

We're going to need to revisit this. One alternative that was
suggested by Andy was to do a late "how much stack space was used?"
check after arg processing was finished. This could be attached to a
secureexec test to limit the checks for pathological conditions only
to setuid processes.

Rob, thanks for the report! Can you confirm that reverting the above
commit fixes the problem? There is also

commit 98da7d08850f ("fs/exec.c: account for argv/envp pointers")

which changes the calculation slightly too. If _SC_ARG_MAX is
hardcoded in bionic and glibc as 1/4 RLIMIT_STACK, we may need to
adjust this commit as well, since it will be a problem for giant
argument lists of very short strings.

Thanks,

-Kees

[1] https://lkml.org/lkml/2017/7/10/633

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
       [not found] ` <CA+55aFyw74DcPygS=SB0d-Fufz3j73zTVp2UXUUOUt4=1_He=Q@mail.gmail.com>
@ 2017-11-02 15:40   ` Linus Torvalds
  2017-11-03 23:58     ` Rob Landley
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-11-02 15:40 UTC (permalink / raw)
  To: Rob Landley; +Cc: Kees Cook, Linux Kernel Mailing List, toybox

On Wed, Nov 1, 2017 at 9:28 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Behavior changed. Things that test particular limits will get different
> results. That's not breakage.
>
> Did an actual user application or script break?

Ahh. I should have read that email more carefully. If xargs broke,
that _will_ break actual scripts, yes. Do you actually set the stack
limit to insane values? Anybody using toybox really shouldn't be doing
32MB stacks.

So I still do wonder if this actually breaks anything real, or just a
test-suite or something?

                   Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-02 15:40   ` Linus Torvalds
@ 2017-11-03 23:58     ` Rob Landley
  2017-11-04  0:03       ` [Toybox] " enh
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Rob Landley @ 2017-11-03 23:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kees Cook, Linux Kernel Mailing List, toybox, enh

On 11/02/2017 10:40 AM, Linus Torvalds wrote:
> On Wed, Nov 1, 2017 at 9:28 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Behavior changed. Things that test particular limits will get different
>> results. That's not breakage.
>>
>> Did an actual user application or script break?

Only due to getting the limit wrong. The actual failure's in the android
internal bugzilla I've never been able to read:

http://lists.landley.net/pipermail/toybox-landley.net/2017-September/009167.html

But it boils down to "got the limit wrong, the exec failed after the
fork(), dynamic recovery from which is awkward so I'm trying to figure
out the right limit".

> Ahh. I should have read that email more carefully. If xargs broke,
> that _will_ break actual scripts, yes. Do you actually set the stack
> limit to insane values? Anybody using toybox really shouldn't be doing
> 32MB stacks.

Toybox is the default command line of android since M, which went 64 bit
in L, and the Pixel 2 phone has 4 gigs of ram. My goal with toybox is to
turn android into a self-hosting development environment no longer
cross-compiled from a PC (http://landley.net/talks/celf-2013.txt) so I'm
trying to implement a command line that can run the entire AOSP build.

I.E. I have no idea what people will do with it, and try not to get in
their way.

My problem here is it's hard to figure out what exec size the limit
_is_. There's a sysconf(_SC_ARG_MAX) which bionic and glibc are
currently returning as stack_limit/4, which is now too big and exec()
will error out after the fork. Musl is returning the 131072 limit from
2011-ish, meaning "/bin/echo $(printf '%0*d' 131071)" works but
"printf '%0*d' 131071 | xargs" fails, an inconsistency I was trying to
avoid. Maybe I don't have that luxury...

Each argument has its own limit separate from the argv+envp total limit,
but there's only one "size" you can query through sysconf, so the
querying API is insufficient at the design level.

Meanwhile under bash you can allocate and dirty 256 megabytes from the
command line with:

  echo $(printf '%0*d' $((1<<28)))

Because it's a shell builtin so there's no actual exec. (And if
https://sourceware.org/bugzilla/show_bug.cgi?id=17829 ever gets fixed
it'll go back to allowing INT_MAX.)

Posix is its usual helpful self, read conservatively
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html
says to break the line at 2048 bytes.

> So I still do wonder if this actually breaks anything real, or just a
> test-suite or something?

I've cc'd Elliott, who would know. (He's the Android base os userspace
maintainer, he knows everything. Or can at least decode
http://b/65818597 .)

But this just broke my _fix_, not the earlier deployed stuff. I removed
the size measuring code when the 131072 limit went away, the bug was
there's a new limit I need to not hit, I tried to figure out what the
limit is now, confirmed that the various libc implementations don't
agree, then the actual kernel limit changed again while I was looking at it.

>                    Linus

Should I just go back to hardwiring in 131072? It's no _less_ arbitrary
than 10 megs, and it sounds like getting it _right_ is unachievable.

Thanks,

Rob

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Toybox] Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-03 23:58     ` Rob Landley
@ 2017-11-04  0:03       ` enh
  2017-11-04  0:42       ` Kees Cook
  2017-11-04  1:07       ` Linus Torvalds
  2 siblings, 0 replies; 15+ messages in thread
From: enh @ 2017-11-04  0:03 UTC (permalink / raw)
  To: Rob Landley
  Cc: Linus Torvalds, enh, Kees Cook, toybox, Linux Kernel Mailing List

On Fri, Nov 3, 2017 at 4:58 PM, Rob Landley <rob@landley.net> wrote:
> On 11/02/2017 10:40 AM, Linus Torvalds wrote:
>> On Wed, Nov 1, 2017 at 9:28 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> Behavior changed. Things that test particular limits will get different
>>> results. That's not breakage.
>>>
>>> Did an actual user application or script break?
>
> Only due to getting the limit wrong. The actual failure's in the android
> internal bugzilla I've never been able to read:
>
> http://lists.landley.net/pipermail/toybox-landley.net/2017-September/009167.html

there was nothing secret or interesting in the original bug report.
just someone trying to use find/xargs on the whole file system. here's
a copy & paste:

marlin:/ # find / /data -xdev -type f ! -name \*.jpg -exec grep -Fl
belkin.a {} +
find: exec grep: Argument list too long

marlin:/ # find / /data -xdev -type f ! -name \*.jpg -print | xargs
grep -Fl belkin.a
xargs: exec grep: Argument list too long

> But it boils down to "got the limit wrong, the exec failed after the
> fork(), dynamic recovery from which is awkward so I'm trying to figure
> out the right limit".
>
>> Ahh. I should have read that email more carefully. If xargs broke,
>> that _will_ break actual scripts, yes. Do you actually set the stack
>> limit to insane values? Anybody using toybox really shouldn't be doing
>> 32MB stacks.
>
> Toybox is the default command line of android since M, which went 64 bit
> in L, and the Pixel 2 phone has 4 gigs of ram. My goal with toybox is to
> turn android into a self-hosting development environment no longer
> cross-compiled from a PC (http://landley.net/talks/celf-2013.txt) so I'm
> trying to implement a command line that can run the entire AOSP build.
>
> I.E. I have no idea what people will do with it, and try not to get in
> their way.
>
> My problem here is it's hard to figure out what exec size the limit
> _is_. There's a sysconf(_SC_ARG_MAX) which bionic and glibc are
> currently returning as stack_limit/4, which is now too big and exec()
> will error out after the fork. Musl is returning the 131072 limit from
> 2011-ish, meaning "/bin/echo $(printf '%0*d' 131071)" works but
> "printf '%0*d' 131071 | xargs" fails, an inconsistency I was trying to
> avoid. Maybe I don't have that luxury...
>
> Each argument has its own limit separate from the argv+envp total limit,
> but there's only one "size" you can query through sysconf, so the
> querying API is insufficient at the design level.
>
> Meanwhile under bash you can allocate and dirty 256 megabytes from the
> command line with:
>
>   echo $(printf '%0*d' $((1<<28)))
>
> Because it's a shell builtin so there's no actual exec. (And if
> https://sourceware.org/bugzilla/show_bug.cgi?id=17829 ever gets fixed
> it'll go back to allowing INT_MAX.)
>
> Posix is its usual helpful self, read conservatively
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html
> says to break the line at 2048 bytes.
>
>> So I still do wonder if this actually breaks anything real, or just a
>> test-suite or something?
>
> I've cc'd Elliott, who would know. (He's the Android base os userspace
> maintainer, he knows everything. Or can at least decode
> http://b/65818597 .)
>
> But this just broke my _fix_, not the earlier deployed stuff. I removed
> the size measuring code when the 131072 limit went away, the bug was
> there's a new limit I need to not hit, I tried to figure out what the
> limit is now, confirmed that the various libc implementations don't
> agree, then the actual kernel limit changed again while I was looking at it.
>
>>                    Linus
>
> Should I just go back to hardwiring in 131072? It's no _less_ arbitrary
> than 10 megs, and it sounds like getting it _right_ is unachievable.
>
> Thanks,
>
> Rob
>
>
> _______________________________________________
> Toybox mailing list
> Toybox@lists.landley.net
> http://lists.landley.net/listinfo.cgi/toybox-landley.net



-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-03 23:58     ` Rob Landley
  2017-11-04  0:03       ` [Toybox] " enh
@ 2017-11-04  0:42       ` Kees Cook
  2017-11-04  1:22         ` Linus Torvalds
  2017-11-04  1:07       ` Linus Torvalds
  2 siblings, 1 reply; 15+ messages in thread
From: Kees Cook @ 2017-11-04  0:42 UTC (permalink / raw)
  To: Rob Landley; +Cc: Linus Torvalds, Linux Kernel Mailing List, toybox, enh

On Fri, Nov 3, 2017 at 4:58 PM, Rob Landley <rob@landley.net> wrote:
> But this just broke my _fix_, not the earlier deployed stuff. I removed
> the size measuring code when the 131072 limit went away, the bug was
> there's a new limit I need to not hit, I tried to figure out what the
> limit is now, confirmed that the various libc implementations don't
> agree, then the actual kernel limit changed again while I was looking at it.

In the fix you landed (to include env in size calculations -- which
didn't change recently), you also included the pointer size itself in
the calculation, so

commit 98da7d08850f ("fs/exec.c: account for argv/envp pointers")

should be handled.

If we didn't do the "but no more than 75% of _STK_LIM", and moved to
something like "check stack utilization after loading the binary", we
end up in the position where the kernel is past the point of no return
(so instead of E2BIG, the execve()ing process just SEGVs), which is
much harder to debug or recover from (i.e. there's no process left to
return from the execve() from). We could, however, limit that behavior
to setuid processes? I'm open to whatever Linus says here.

FWIW, I have a lightly tested alternative here (should be visible shortly):
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/stack-size

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-03 23:58     ` Rob Landley
  2017-11-04  0:03       ` [Toybox] " enh
  2017-11-04  0:42       ` Kees Cook
@ 2017-11-04  1:07       ` Linus Torvalds
  2017-11-05  0:39         ` Rob Landley
  2 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-11-04  1:07 UTC (permalink / raw)
  To: Rob Landley; +Cc: Kees Cook, Linux Kernel Mailing List, toybox, enh

On Fri, Nov 3, 2017 at 4:58 PM, Rob Landley <rob@landley.net> wrote:
> On 11/02/2017 10:40 AM, Linus Torvalds wrote:
>
> But it boils down to "got the limit wrong, the exec failed after the
> fork(), dynamic recovery from which is awkward so I'm trying to figure
> out the right limit".

Well, the thing is, you would only get the limit wrong if your
RLIMIT_STACK is set to some insane value.

>> Ahh. I should have read that email more carefully. If xargs broke,
>> that _will_ break actual scripts, yes. Do you actually set the stack
>> limit to insane values? Anybody using toybox really shouldn't be doing
>> 32MB stacks.
>
> Toybox is the default command line of android since M, which went 64 bit
> in L, and the Pixel 2 phone has 4 gigs of ram.

So?

My desktop has 32GB of ram, and is running a distro that sets the
kernel configuration to MAXSMP because the distro people don't want to
have multiple kernels, and some peoople run it on big hardware with
terabytes of RAM and thousands of cores.

And yet, on that distro, I do:

    [torvalds@i7 linux]$ ulimit -s
    8192

ie the stack limit hasn't been increased from the default 8MB.

So that whole "let's make the stack limit crazy" is actually the core
problem in your particular equation.

If you have a sane stack limit (anything less than 24MB), you'd not
have seen the xargs issue.

That said, _SC_ARG_MAX really is badly defined. In many ways, 128k is
still the correct limit, not because it's the historical one, but
because it is MAX_ARG_STRLEN.

It's the biggest single string we allow (although strictly speaking,
it's not really 128kB, it's 32*PAGE_SIZE, for historical reasons.

So there simply isn't a single limit, and never has been. The
traditional value is 128kB, then for a while we didn't have any limit
at all, then we did that RLIMIT_STACK/4 (but only for the strings),
then we did RLIMIT_STACK/4 (but taking the pointers into account too),
and then we did that "limit it to at most three quarters of
_RLIM_STK")

I suspect we _do_ have to raise that limit, because clearly this is a
regression, but I absolutely _detest_ the fact that a stupid
_embedded_ OS thinks that it should have a bigger stack limit than
stuff that runs on supercomputers.

That just makes me go "there's something seriously wrong".

> My problem here is it's hard to figure out what exec size the limit
> _is_. There's a sysconf(_SC_ARG_MAX) which bionic and glibc are
> currently returning as stack_limit/4, which is now too big and exec()
> will error out after the fork. Musl is returning the 131072 limit from
> 2011-ish, meaning "/bin/echo $(printf '%0*d' 131071)" works but
> "printf '%0*d' 131071 | xargs" fails, an inconsistency I was trying to
> avoid. Maybe I don't have that luxury...

Honestly, lots of the POSIX SC limits are questionable.

In this case, _SC_ARG_MAX is garbage because it's simply not even
well-defined. It really is 32*PAGE_SIZE if all you have is one single
long argument, because that's the largest single string we accept.
Make it one byte bigger, and we'll return E2BIG, as you found out.

But at the same time, it can clearly also be 6MB, since that's what we
accept if the stack limit are big enough, and yes, it used to be even
bigger.

For something like "xargs", I'm actually really saddened by the stupid
decision to think it's a single value. The whole and *only* reason for
xargs to exist is to just get it right, and the natural thing for
xargs to do would be to not ask, but simply try to do the whole thing,
and if you get E2BIG, you decide to split it in half or something
until it works. That kind of approach would just make it work
_without_ depending on some magic value.

The fact that apparently xargs is too stupid to do that, and instead
requires _SC_ARG_MAX to magically give it the "One True Value(tm)" is
just all kinds of crap.

Oh well. Enough ranting.

What _is_ the stack limit when using toybox? Is it just entirely unlimited?

> Should I just go back to hardwiring in 131072? It's no _less_ arbitrary
> than 10 megs, and it sounds like getting it _right_ is unachievable.

So in a perfect world, nobody should use that value.

But we can certainly change the kernel behavior back too.

But you realize that then we still would limit suid binaries, and now
your "xargs" would suddenly work with normal binaries, but break if
it's a suid binary?

So it would certainly just be nicer if toybox had a sane stack limit
and none of this would matter.

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-04  0:42       ` Kees Cook
@ 2017-11-04  1:22         ` Linus Torvalds
  2017-11-04  1:37           ` Kees Cook
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-11-04  1:22 UTC (permalink / raw)
  To: Kees Cook; +Cc: Rob Landley, Linux Kernel Mailing List, toybox, enh

On Fri, Nov 3, 2017 at 5:42 PM, Kees Cook <keescook@chromium.org> wrote:
>
> If we didn't do the "but no more than 75% of _STK_LIM", and moved to
> something like "check stack utilization after loading the binary", we
> end up in the position where the kernel is past the point of no return
> (so instead of E2BIG, the execve()ing process just SEGVs), which is
> much harder to debug or recover from (i.e. there's no process left to
> return from the execve() from).

Yeah, we've had that problem in the past, and it's the worst of all worlds.

You can still trigger it (set RLIMIT_DATA to something much too small,
for example, and then generate more than that by just repeating the
same argument multiple times so that the execve() user doesn't trigger
the limit, but the newly executed process does).

But it should really be something that you need to be truly insane to trigger.

I think we still don't know whether we're going to be suid at the time
we copy the arguments, do we?

So it's pretty painful to make the limits different for suid and
non-suid binaries.

                 Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-04  1:22         ` Linus Torvalds
@ 2017-11-04  1:37           ` Kees Cook
  2017-11-05  1:10             ` Rob Landley
  0 siblings, 1 reply; 15+ messages in thread
From: Kees Cook @ 2017-11-04  1:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rob Landley, Linux Kernel Mailing List, toybox, enh

On Fri, Nov 3, 2017 at 6:22 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Nov 3, 2017 at 5:42 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> If we didn't do the "but no more than 75% of _STK_LIM", and moved to
>> something like "check stack utilization after loading the binary", we
>> end up in the position where the kernel is past the point of no return
>> (so instead of E2BIG, the execve()ing process just SEGVs), which is
>> much harder to debug or recover from (i.e. there's no process left to
>> return from the execve() from).
>
> Yeah, we've had that problem in the past, and it's the worst of all worlds.
>
> You can still trigger it (set RLIMIT_DATA to something much too small,
> for example, and then generate more than that by just repeating the
> same argument multiple times so that the execve() user doesn't trigger
> the limit, but the newly executed process does).
>
> But it should really be something that you need to be truly insane to trigger.
>
> I think we still don't know whether we're going to be suid at the time
> we copy the arguments, do we?

We don't. (In fact, arg copying happens before we've even figured out
which binfmt is involved.) I lifted it to just before the point of no
return, but moving it before arg copying looks very hard (which
contributed to why we went with the implementation we did).

> So it's pretty painful to make the limits different for suid and
> non-suid binaries.

I would agree.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-04  1:07       ` Linus Torvalds
@ 2017-11-05  0:39         ` Rob Landley
  2017-11-05 20:46           ` Linus Torvalds
  2017-11-15 21:12           ` Pavel Machek
  0 siblings, 2 replies; 15+ messages in thread
From: Rob Landley @ 2017-11-05  0:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kees Cook, Linux Kernel Mailing List, toybox, enh

Correcting Elliot's email to google, not gmail. (Sorry, I'm in Tokyo for
work this month, almost over the jetlag...)

On 11/03/2017 08:07 PM, Linus Torvalds wrote:
> On Fri, Nov 3, 2017 at 4:58 PM, Rob Landley <rob@landley.net> wrote:
>> On 11/02/2017 10:40 AM, Linus Torvalds wrote:
>>
>> But it boils down to "got the limit wrong, the exec failed after the
>> fork(), dynamic recovery from which is awkward so I'm trying to figure
>> out the right limit".

Sounds later like dynamic recovery is what you recommend. (Awkward
doesn't mean I can't do it.)

> I suspect we _do_ have to raise that limit, because clearly this is a
> regression, but I absolutely _detest_ the fact that a stupid
> _embedded_ OS thinks that it should have a bigger stack limit than
> stuff that runs on supercomputers.
> 
> That just makes me go "there's something seriously wrong".

This was me trying not to assume what other people will do, I think
android's default is still 8mb (it was in M) but my test systems for
this are literally on the other side of the planet right now.

Google's internal frame of reference is very different from mine. I got
pointed at a podcast (Android Developers Backstage #53) where Elliott
and another android dev talked about toybox for a few minutes in the
second half, they they shared a chuckle over my complaint that
downloading AOSP takes 150 gigabytes _before_ it tries to build
anything, and only the largest machine I own can build it at all (and
that very slowly). It was just so alien to them that this would be a
_problem_...

> For something like "xargs", I'm actually really saddened by the stupid
> decision to think it's a single value. The whole and *only* reason for
> xargs to exist is to just get it right,

Which is what I was trying very hard to do. :(

> and the natural thing for
> xargs to do would be to not ask, but simply try to do the whole thing,
> and if you get E2BIG, you decide to split it in half or something
> until it works. That kind of approach would just make it work
> _without_ depending on some magic value.
> 
> The fact that apparently xargs is too stupid to do that, and instead
> requires _SC_ARG_MAX to magically give it the "One True Value(tm)" is
> just all kinds of crap.

I'm writing this xargs, I can _make_ it do that, it just requires a pipe
back from the forked child to return status and is either slow (remove
one argument at a time) or inaccurate (cut it in half, result coulda
been longer). Either way xargs still needs an internal limit or "yes |
xargs" will try to fill all memory before ever calling exec().

The reason I wanted to support "exactly as big as possible" is that
calling a command as one invocation vs multiple invocations can change
behavior. Once you've decided to split, how BIG you split is much less
important, so falling back to an arbitrary limit would be fine except
I'd still have to check the stack size to see if it's _lower_ than that
arbitrary limit. (If you set the stack ulimit to 128k, which nommu
systems may wanna do, then the exec limit is 32k. It can be _anything_.)

And this limit is shared with environment variables so the problem might
be that your environment's pathological and you can't run this command
line with even one argument because envp ate all the space, but that's
another story and the user can wash it through env -i to make it work.
Except:

  $ env -i {A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P}=$(printf '%0*d' 130657) \
      env | wc -c

Says 2090560 (of 2097152), but 130658 says argument list too long when
it's only 16 more bytes of the ~6k we should have left (envp[]=17*8,
argc=2*8, argv[0]=4...) argc and it sounds like you're saying I should
just stop _trying_ to figure out exact up-front measurements.

So stacksize /4, then split in half each time, and if it strips down to
one argument that can't run, have an error message for that. Ok.

> Oh well. Enough ranting.
> 
> What _is_ the stack limit when using toybox? Is it just entirely unlimited?

Answer to second question on ubuntu 14.04:

  landley@driftwood:~/linux/linux/fs$ ulimit -s 999999999
  landley@driftwood:~/linux/linux/fs$ ulimit -s
  999999999

Anybody can call ulimit to expand it as a normal user, so effectively
yes it is unlimited. I have no IDEA what my users are gonna do. (If they
do something stupid it's their fault, but I don't necessarily get to say
what stupid is from here.)

Answer to first: the default is whatever I inherited from the Android
fork du jour it's running on.

The google developers seem to be drinking from a firehose of
contributions from the half-dozen phone companies trying to get code
upstream. Elliott presumably says no to what he can but they're hugely
outnumbered and there's politics I'm only dimly aware of (never having
worked for google and only having met Elliott for lunch once a couple
years ago when I was in town for ELC anyway, this is all just the
impression of an interested outside party).

Then there's more companies in China and such (like Xiaomi,
https://www.youtube.com/watch?v=fR6K1l3sfm8#t=1m38s) that probably don't
even send code back to Google (because Great Firewall) but still use
their own forks of android infrastructure which they modify the hell out of.

I can theoretically test the vanilla AOSP build du jour under qemu, but
thats like the vanilla kernel: what winds up in the hands of 99% of end
users is a fork of a fork.

Mostly I trust Elliott to point out when I've violated an Android
assumption (often via the patch comment).

(And then there's the Android Native Development Kit, which _almost_
builds toybox outside of AOSP, but will never spray it down with the
full selinux environment ala
http://lists.landley.net/pipermail/toybox-landley.net/2016-December/008772.html
so it'll build a _partial_ toybox at best, by design. I think we left
off around
http://lists.landley.net/pipermail/toybox-landley.net/2017-April/008970.html
and I should poke at it again when I get time...)

I got _into_ this trying to transplant the android build to run under
android natively (http://landley.net/toybox/#21-03-2013) and the
discussions I had with Elliott about that on the toybox list involved
creating a "posix container" under minijail. (Each android app runs
under its own UID, but a build system sort of needs a UID range, so he
went off to teach android apps about this new concept and I got
distracted by $DAYJOB and we haven't gotten back to it since...) Such a
container would presumably have its own environment setup, and "is 8
megs the right stack size for that" is question I never asked. Right now
AOSP can expand it as needed deep in some nested Ninja file unless they
added an selinux rule to stop it.

tl;dr You ask hard questions, I have no idea.

>> Should I just go back to hardwiring in 131072? It's no _less_ arbitrary
>> than 10 megs, and it sounds like getting it _right_ is unachievable.
> 
> So in a perfect world, nobody should use that value.
> 
> But we can certainly change the kernel behavior back too.
> 
> But you realize that then we still would limit suid binaries, and now
> your "xargs" would suddenly work with normal binaries, but break if
> it's a suid binary?

It sounds like "probe and recover" is my best option.

> So it would certainly just be nicer if toybox had a sane stack limit
> and none of this would matter.

That's like saying coreutils should have a sane stack limit. It's
command line utilities, not an OS. I inherit prefs and live with them.

Android has its own init subsystem that sprays the world down with
selinux rules before anything else gets to run, meaning the AOSP build
annotates the toybox binary it installs with enough extended attributes
to choke a cow in order for things like "ps" to work. All the other
android builds are forks off AOSP (the Android Open Source Project),
which is the "upstream vanilla" for that distro. It is a GIANT HAIRBALL
and dismantling it enough to figure out what it actually does is a todo
item for me, and I haven't even found time to watch all of
https://www.youtube.com/watch?v=dEKYZUgorWQ yet...

But I support standard Linux _too_, so right now I mostly develop on
vanilla and then Elliott tells me when I screwed up. :)

>                 Linus
Rob

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-04  1:37           ` Kees Cook
@ 2017-11-05  1:10             ` Rob Landley
  0 siblings, 0 replies; 15+ messages in thread
From: Rob Landley @ 2017-11-05  1:10 UTC (permalink / raw)
  To: Kees Cook, Linus Torvalds; +Cc: Linux Kernel Mailing List, toybox, enh

On 11/03/2017 08:37 PM, Kees Cook wrote:
> We don't. (In fact, arg copying happens before we've even figured out
> which binfmt is involved.) I lifted it to just before the point of no
> return, but moving it before arg copying looks very hard (which
> contributed to why we went with the implementation we did).
> 
>> So it's pretty painful to make the limits different for suid and
>> non-suid binaries.
> 
> I would agree.

I think I know what to implement for toybox now: xargs should trust
libc's sysconf() to provide the common-case starting limit (subtracting
env space) then implement the fallback pipe-from-child thing to
iteratively try half the argument list when that fails.

Elliott's even cc'd so he can update bionic's sysconf for the new 10 meg
thing from the title commit. :)

Rob

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-05  0:39         ` Rob Landley
@ 2017-11-05 20:46           ` Linus Torvalds
  2017-11-15 22:10             ` enh
  2017-11-15 21:12           ` Pavel Machek
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-11-05 20:46 UTC (permalink / raw)
  To: Rob Landley; +Cc: Kees Cook, Linux Kernel Mailing List, toybox, enh

On Sat, Nov 4, 2017 at 5:39 PM, Rob Landley <rob@landley.net> wrote:
> On 11/03/2017 08:07 PM, Linus Torvalds wrote:
>>>
>>> But it boils down to "got the limit wrong, the exec failed after the
>>> fork(), dynamic recovery from which is awkward so I'm trying to figure
>>> out the right limit".
>
> Sounds later like dynamic recovery is what you recommend. (Awkward
> doesn't mean I can't do it.)

So my preferred solution would certainly be for xargs to realize that
_SC_ARG_MAX is at most an "informed guess" rather than some absolute
value.

Or, alternatively, simply use a _SC_ARG_MAX that is small enough that
we can say "yes, that's what we've always supported". That, in
practice, is the 128kB value.

It's probably also big enough that nobody cares any more. Sure, you
*could* feed bigger arguments, and we do end up basically supporting
it because people who don't write proper scripts will simply do

    grep some-value $(find . -name '*.c')

and that works pretty well if your code-base isn't humongous.

It still works for the kernel, for example, but there's no question
that it's still not something we *guarantee* works.

It doesn't work if you don't limit it to *.c files:

  [torvalds@i7 linux]$ grep some-value $(find .)
  -bash: /usr/bin/grep: Argument list too long

so when the kernel expanded the argument list from the traditional
128kB, it was for _convenience_, it was not meant for people who do
proper scripting and use xargs.

See the difference?

>> What _is_ the stack limit when using toybox? Is it just entirely unlimited?
>
> Answer to second question on ubuntu 14.04:
>
>   landley@driftwood:~/linux/linux/fs$ ulimit -s 999999999
>   landley@driftwood:~/linux/linux/fs$ ulimit -s
>   999999999

Oh, I can do that on Fedora too.

But the thing is, nobody actually seems to do that.

And our regression rule has never been "behavior doesn't change".
That would mean that we could never make any changes at all.

For example, we do things like add new error handling etc all the
time, which we then sometimes even add tests for in our kselftest
directory.

So clearly behavior changes all the time and we don't consider that a
regression per se.

The rule for a regression for the kernel is that some real user
workflow breaks. Not some test. Not a "look, I used to be able to do
X, now I can't".

So that's why this whole "is this a test or a real workload" question
is important to me, and what odd RLIMIT_STACK people actually use.

I'm not interested in "can you reproduce it", because that's simply
not the issue for us from a regression standpoint.

That said, I tried this under Fedora:

  ulimit -s unlimited
  find / -print0 2> /dev/null | xargs -0 /usr/bin/echo | wc

and it reported 991 lines. That would match just using 128kB as the
breaking point for xargs.

So apparently at least Fedora doesn't seem to use that "stack limit /
4" thing, but the traditional 128kB value.

I don't know if that is because of 'xargs', or because of library
differences - I didn't check.

And I have a hard time judging whether the regression report is a
"look, this behavior changed", or a "we actually have a real
regression visible to actual users".

See above why that matters to me.

> It sounds like "probe and recover" is my best option.

I actually suspect "just use 128kB" is the actual best option in practice.

Sure, "probe and recover" is the best option in theory, but it's a lot
more complex, and there doesn't actually seem to be a lot of upside.

So to take my silly example of piping "find /" to xargs: I can't
imagine that anybody sane should care really whether it results in
900+ execve's (for that 128kB limit) or just 20 (for some "optimal"
6MB limit).

And there is actually a somewhat real fear with the
"probe-and-recover" model: the already mentioned nasty case "sometimes
we don't return E2BIG, we might just succeed the execve, but then kill
the process very early due to some _other_ limit being triggered".

That nasty case really shouldn't happen, but it was the case we
considered (and rejected) for suid binaries.

So that's the real reason the kernel behavior changed: we had a
security issue with the "we allow basically unlimited stack growth"
where a huge argv/envp could be used to grow the stack into some other
segment.

Realistically, it was only a security issue with suid binaries, but as
mentioned in this thread, the problem is that we do the argument
copying even before we've decided whether the execve is going to be a
suid execve.

So then - exactly so that we do *not* have that nasty case of "execve
succeeds, but we kill the process immediately if it turns into a
potential security issue", we do that "limit the stack growth for
everybody".

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-05  0:39         ` Rob Landley
  2017-11-05 20:46           ` Linus Torvalds
@ 2017-11-15 21:12           ` Pavel Machek
  1 sibling, 0 replies; 15+ messages in thread
From: Pavel Machek @ 2017-11-15 21:12 UTC (permalink / raw)
  To: Rob Landley
  Cc: Linus Torvalds, Kees Cook, Linux Kernel Mailing List, toybox, enh

[-- Attachment #1: Type: text/plain, Size: 838 bytes --]

Hi!

> Google's internal frame of reference is very different from mine. I got
> pointed at a podcast (Android Developers Backstage #53) where Elliott
> and another android dev talked about toybox for a few minutes in the
> second half, they they shared a chuckle over my complaint that
> downloading AOSP takes 150 gigabytes _before_ it tries to build
> anything, and only the largest machine I own can build it at all (and
> that very slowly). It was just so alien to them that this would be a
> _problem_...

Heh. Some time ago, I tried to build AOSP... not realizing it would be
a problem. (Its an open source project, right?) Fortunately I gave up
quite soon...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-05 20:46           ` Linus Torvalds
@ 2017-11-15 22:10             ` enh
  2017-11-15 22:45               ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: enh @ 2017-11-15 22:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rob Landley, Kees Cook, Linux Kernel Mailing List, toybox

On Sun, Nov 5, 2017 at 12:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Nov 4, 2017 at 5:39 PM, Rob Landley <rob@landley.net> wrote:
>> On 11/03/2017 08:07 PM, Linus Torvalds wrote:
>>>>
>>>> But it boils down to "got the limit wrong, the exec failed after the
>>>> fork(), dynamic recovery from which is awkward so I'm trying to figure
>>>> out the right limit".
>>
>> Sounds later like dynamic recovery is what you recommend. (Awkward
>> doesn't mean I can't do it.)
>
> So my preferred solution would certainly be for xargs to realize that
> _SC_ARG_MAX is at most an "informed guess" rather than some absolute
> value.
>
> Or, alternatively, simply use a _SC_ARG_MAX that is small enough that
> we can say "yes, that's what we've always supported". That, in
> practice, is the 128kB value.
>
> It's probably also big enough that nobody cares any more. Sure, you
> *could* feed bigger arguments, and we do end up basically supporting
> it because people who don't write proper scripts will simply do
>
>     grep some-value $(find . -name '*.c')
>
> and that works pretty well if your code-base isn't humongous.
>
> It still works for the kernel, for example, but there's no question
> that it's still not something we *guarantee* works.
>
> It doesn't work if you don't limit it to *.c files:
>
>   [torvalds@i7 linux]$ grep some-value $(find .)
>   -bash: /usr/bin/grep: Argument list too long
>
> so when the kernel expanded the argument list from the traditional
> 128kB, it was for _convenience_, it was not meant for people who do
> proper scripting and use xargs.
>
> See the difference?
>
>>> What _is_ the stack limit when using toybox? Is it just entirely unlimited?
>>
>> Answer to second question on ubuntu 14.04:
>>
>>   landley@driftwood:~/linux/linux/fs$ ulimit -s 999999999
>>   landley@driftwood:~/linux/linux/fs$ ulimit -s
>>   999999999
>
> Oh, I can do that on Fedora too.
>
> But the thing is, nobody actually seems to do that.
>
> And our regression rule has never been "behavior doesn't change".
> That would mean that we could never make any changes at all.
>
> For example, we do things like add new error handling etc all the
> time, which we then sometimes even add tests for in our kselftest
> directory.
>
> So clearly behavior changes all the time and we don't consider that a
> regression per se.
>
> The rule for a regression for the kernel is that some real user
> workflow breaks. Not some test. Not a "look, I used to be able to do
> X, now I can't".
>
> So that's why this whole "is this a test or a real workload" question
> is important to me, and what odd RLIMIT_STACK people actually use.

Android's default RLIMIT_STACK is 8192.

the specific bug report was from an interactive shell user. i'm not
aware of any part of the platform itself that relies on this, nor any
third-party app.

> I'm not interested in "can you reproduce it", because that's simply
> not the issue for us from a regression standpoint.
>
> That said, I tried this under Fedora:
>
>   ulimit -s unlimited
>   find / -print0 2> /dev/null | xargs -0 /usr/bin/echo | wc
>
> and it reported 991 lines. That would match just using 128kB as the
> breaking point for xargs.
>
> So apparently at least Fedora doesn't seem to use that "stack limit /
> 4" thing, but the traditional 128kB value.
>
> I don't know if that is because of 'xargs', or because of library
> differences - I didn't check.
>
> And I have a hard time judging whether the regression report is a
> "look, this behavior changed", or a "we actually have a real
> regression visible to actual users".
>
> See above why that matters to me.
>
>> It sounds like "probe and recover" is my best option.
>
> I actually suspect "just use 128kB" is the actual best option in practice.

for libc's sysconf(_SC_ARG_MAX) too? i'm fine changing bionic back to
reporting 128KiB if there's an lkml "Linus says" mail that i can link
to in the comment. it certainly seems like an overly-conservative
choice is better than the current situation...

> Sure, "probe and recover" is the best option in theory, but it's a lot
> more complex, and there doesn't actually seem to be a lot of upside.
>
> So to take my silly example of piping "find /" to xargs: I can't
> imagine that anybody sane should care really whether it results in
> 900+ execve's (for that 128kB limit) or just 20 (for some "optimal"
> 6MB limit).
>
> And there is actually a somewhat real fear with the
> "probe-and-recover" model: the already mentioned nasty case "sometimes
> we don't return E2BIG, we might just succeed the execve, but then kill
> the process very early due to some _other_ limit being triggered".
>
> That nasty case really shouldn't happen, but it was the case we
> considered (and rejected) for suid binaries.
>
> So that's the real reason the kernel behavior changed: we had a
> security issue with the "we allow basically unlimited stack growth"
> where a huge argv/envp could be used to grow the stack into some other
> segment.
>
> Realistically, it was only a security issue with suid binaries, but as
> mentioned in this thread, the problem is that we do the argument
> copying even before we've decided whether the execve is going to be a
> suid execve.
>
> So then - exactly so that we do *not* have that nasty case of "execve
> succeeds, but we kill the process immediately if it turns into a
> potential security issue", we do that "limit the stack growth for
> everybody".
>
>                 Linus



-- 
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Regression: commit da029c11e6b1 broke toybox xargs.
  2017-11-15 22:10             ` enh
@ 2017-11-15 22:45               ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2017-11-15 22:45 UTC (permalink / raw)
  To: enh; +Cc: Rob Landley, Kees Cook, Linux Kernel Mailing List, toybox

On Wed, Nov 15, 2017 at 2:10 PM, enh <enh@google.com> wrote:
>>
>> I actually suspect "just use 128kB" is the actual best option in practice.
>
> for libc's sysconf(_SC_ARG_MAX) too? i'm fine changing bionic back to
> reporting 128KiB if there's an lkml "Linus says" mail that i can link
> to in the comment. it certainly seems like an overly-conservative
> choice is better than the current situation...

I suspect a 128kB sysconf(_SC_ARG_MAX) is the sanest bet, simply
because of that "conservative is better than aggressive".

Especially since _technically_ we're still limiting things to that
128kB due to the single-string limit.

              Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-11-15 22:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-01 23:34 Regression: commit da029c11e6b1 broke toybox xargs Rob Landley
2017-11-02  3:30 ` Kees Cook
     [not found] ` <CA+55aFyw74DcPygS=SB0d-Fufz3j73zTVp2UXUUOUt4=1_He=Q@mail.gmail.com>
2017-11-02 15:40   ` Linus Torvalds
2017-11-03 23:58     ` Rob Landley
2017-11-04  0:03       ` [Toybox] " enh
2017-11-04  0:42       ` Kees Cook
2017-11-04  1:22         ` Linus Torvalds
2017-11-04  1:37           ` Kees Cook
2017-11-05  1:10             ` Rob Landley
2017-11-04  1:07       ` Linus Torvalds
2017-11-05  0:39         ` Rob Landley
2017-11-05 20:46           ` Linus Torvalds
2017-11-15 22:10             ` enh
2017-11-15 22:45               ` Linus Torvalds
2017-11-15 21:12           ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).