Re: Regression: commit da029c11e6b1 broke toybox xargs.

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Rob Landley <rob@landley.net>
Cc: Kees Cook <keescook@chromium.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	toybox@lists.landley.net, enh@google.com
Subject: Re: Regression: commit da029c11e6b1 broke toybox xargs.
Date: Sun, 5 Nov 2017 12:46:35 -0800	[thread overview]
Message-ID: <CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com> (raw)
In-Reply-To: <59f9380b-fc9b-c6a5-998a-a603ef828d1d@landley.net>

On Sat, Nov 4, 2017 at 5:39 PM, Rob Landley <rob@landley.net> wrote:
> On 11/03/2017 08:07 PM, Linus Torvalds wrote:
>>>
>>> But it boils down to "got the limit wrong, the exec failed after the
>>> fork(), dynamic recovery from which is awkward so I'm trying to figure
>>> out the right limit".
>
> Sounds later like dynamic recovery is what you recommend. (Awkward
> doesn't mean I can't do it.)

So my preferred solution would certainly be for xargs to realize that
_SC_ARG_MAX is at most an "informed guess" rather than some absolute
value.

Or, alternatively, simply use a _SC_ARG_MAX that is small enough that
we can say "yes, that's what we've always supported". That, in
practice, is the 128kB value.

It's probably also big enough that nobody cares any more. Sure, you
*could* feed bigger arguments, and we do end up basically supporting
it because people who don't write proper scripts will simply do

    grep some-value $(find . -name '*.c')

and that works pretty well if your code-base isn't humongous.

It still works for the kernel, for example, but there's no question
that it's still not something we *guarantee* works.

It doesn't work if you don't limit it to *.c files:

  [torvalds@i7 linux]$ grep some-value $(find .)
  -bash: /usr/bin/grep: Argument list too long

so when the kernel expanded the argument list from the traditional
128kB, it was for _convenience_, it was not meant for people who do
proper scripting and use xargs.

See the difference?

>> What _is_ the stack limit when using toybox? Is it just entirely unlimited?
>
> Answer to second question on ubuntu 14.04:
>
>   landley@driftwood:~/linux/linux/fs$ ulimit -s 999999999
>   landley@driftwood:~/linux/linux/fs$ ulimit -s
>   999999999

Oh, I can do that on Fedora too.

But the thing is, nobody actually seems to do that.

And our regression rule has never been "behavior doesn't change".
That would mean that we could never make any changes at all.

For example, we do things like add new error handling etc all the
time, which we then sometimes even add tests for in our kselftest
directory.

So clearly behavior changes all the time and we don't consider that a
regression per se.

The rule for a regression for the kernel is that some real user
workflow breaks. Not some test. Not a "look, I used to be able to do
X, now I can't".

So that's why this whole "is this a test or a real workload" question
is important to me, and what odd RLIMIT_STACK people actually use.

I'm not interested in "can you reproduce it", because that's simply
not the issue for us from a regression standpoint.

That said, I tried this under Fedora:

  ulimit -s unlimited
  find / -print0 2> /dev/null | xargs -0 /usr/bin/echo | wc

and it reported 991 lines. That would match just using 128kB as the
breaking point for xargs.

So apparently at least Fedora doesn't seem to use that "stack limit /
4" thing, but the traditional 128kB value.

I don't know if that is because of 'xargs', or because of library
differences - I didn't check.

And I have a hard time judging whether the regression report is a
"look, this behavior changed", or a "we actually have a real
regression visible to actual users".

See above why that matters to me.

> It sounds like "probe and recover" is my best option.

I actually suspect "just use 128kB" is the actual best option in practice.

Sure, "probe and recover" is the best option in theory, but it's a lot
more complex, and there doesn't actually seem to be a lot of upside.

So to take my silly example of piping "find /" to xargs: I can't
imagine that anybody sane should care really whether it results in
900+ execve's (for that 128kB limit) or just 20 (for some "optimal"
6MB limit).

And there is actually a somewhat real fear with the
"probe-and-recover" model: the already mentioned nasty case "sometimes
we don't return E2BIG, we might just succeed the execve, but then kill
the process very early due to some _other_ limit being triggered".

That nasty case really shouldn't happen, but it was the case we
considered (and rejected) for suid binaries.

So that's the real reason the kernel behavior changed: we had a
security issue with the "we allow basically unlimited stack growth"
where a huge argv/envp could be used to grow the stack into some other
segment.

Realistically, it was only a security issue with suid binaries, but as
mentioned in this thread, the problem is that we do the argument
copying even before we've decided whether the execve is going to be a
suid execve.

So then - exactly so that we do *not* have that nasty case of "execve
succeeds, but we kill the process immediately if it turns into a
potential security issue", we do that "limit the stack growth for
everybody".

                Linus