Re: Regression: commit da029c11e6b1 broke toybox xargs.

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Rob Landley <rob@landley.net>
Cc: Kees Cook <keescook@chromium.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	toybox@lists.landley.net, enh@gmail.com
Subject: Re: Regression: commit da029c11e6b1 broke toybox xargs.
Date: Fri, 3 Nov 2017 18:07:39 -0700	[thread overview]
Message-ID: <CA+55aFyvL3Y3XE4t6O0-HVNTuzFwqAvZUW-6=HX7m4bf+qpJ1w@mail.gmail.com> (raw)
In-Reply-To: <0b3a9bd0-3046-cdab-cfee-0ca45ee64e8d@landley.net>

On Fri, Nov 3, 2017 at 4:58 PM, Rob Landley <rob@landley.net> wrote:
> On 11/02/2017 10:40 AM, Linus Torvalds wrote:
>
> But it boils down to "got the limit wrong, the exec failed after the
> fork(), dynamic recovery from which is awkward so I'm trying to figure
> out the right limit".

Well, the thing is, you would only get the limit wrong if your
RLIMIT_STACK is set to some insane value.

>> Ahh. I should have read that email more carefully. If xargs broke,
>> that _will_ break actual scripts, yes. Do you actually set the stack
>> limit to insane values? Anybody using toybox really shouldn't be doing
>> 32MB stacks.
>
> Toybox is the default command line of android since M, which went 64 bit
> in L, and the Pixel 2 phone has 4 gigs of ram.

So?

My desktop has 32GB of ram, and is running a distro that sets the
kernel configuration to MAXSMP because the distro people don't want to
have multiple kernels, and some peoople run it on big hardware with
terabytes of RAM and thousands of cores.

And yet, on that distro, I do:

    [torvalds@i7 linux]$ ulimit -s
    8192

ie the stack limit hasn't been increased from the default 8MB.

So that whole "let's make the stack limit crazy" is actually the core
problem in your particular equation.

If you have a sane stack limit (anything less than 24MB), you'd not
have seen the xargs issue.

That said, _SC_ARG_MAX really is badly defined. In many ways, 128k is
still the correct limit, not because it's the historical one, but
because it is MAX_ARG_STRLEN.

It's the biggest single string we allow (although strictly speaking,
it's not really 128kB, it's 32*PAGE_SIZE, for historical reasons.

So there simply isn't a single limit, and never has been. The
traditional value is 128kB, then for a while we didn't have any limit
at all, then we did that RLIMIT_STACK/4 (but only for the strings),
then we did RLIMIT_STACK/4 (but taking the pointers into account too),
and then we did that "limit it to at most three quarters of
_RLIM_STK")

I suspect we _do_ have to raise that limit, because clearly this is a
regression, but I absolutely _detest_ the fact that a stupid
_embedded_ OS thinks that it should have a bigger stack limit than
stuff that runs on supercomputers.

That just makes me go "there's something seriously wrong".

> My problem here is it's hard to figure out what exec size the limit
> _is_. There's a sysconf(_SC_ARG_MAX) which bionic and glibc are
> currently returning as stack_limit/4, which is now too big and exec()
> will error out after the fork. Musl is returning the 131072 limit from
> 2011-ish, meaning "/bin/echo $(printf '%0*d' 131071)" works but
> "printf '%0*d' 131071 | xargs" fails, an inconsistency I was trying to
> avoid. Maybe I don't have that luxury...

Honestly, lots of the POSIX SC limits are questionable.

In this case, _SC_ARG_MAX is garbage because it's simply not even
well-defined. It really is 32*PAGE_SIZE if all you have is one single
long argument, because that's the largest single string we accept.
Make it one byte bigger, and we'll return E2BIG, as you found out.

But at the same time, it can clearly also be 6MB, since that's what we
accept if the stack limit are big enough, and yes, it used to be even
bigger.

For something like "xargs", I'm actually really saddened by the stupid
decision to think it's a single value. The whole and *only* reason for
xargs to exist is to just get it right, and the natural thing for
xargs to do would be to not ask, but simply try to do the whole thing,
and if you get E2BIG, you decide to split it in half or something
until it works. That kind of approach would just make it work
_without_ depending on some magic value.

The fact that apparently xargs is too stupid to do that, and instead
requires _SC_ARG_MAX to magically give it the "One True Value(tm)" is
just all kinds of crap.

Oh well. Enough ranting.

What _is_ the stack limit when using toybox? Is it just entirely unlimited?

> Should I just go back to hardwiring in 131072? It's no _less_ arbitrary
> than 10 megs, and it sounds like getting it _right_ is unachievable.

So in a perfect world, nobody should use that value.

But we can certainly change the kernel behavior back too.

But you realize that then we still would limit suid binaries, and now
your "xargs" would suddenly work with normal binaries, but break if
it's a suid binary?

So it would certainly just be nicer if toybox had a sane stack limit
and none of this would matter.

                Linus