Re: data loss when doing ls-remote and piped to command

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Rolf Eike Beer <eb@emlix.com>
Cc: Git List Mailing <git@vger.kernel.org>,
	Tobias Ulmer <tu@emlix.com>, Junio C Hamano <gitster@pobox.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: data loss when doing ls-remote and piped to command
Date: Thu, 16 Sep 2021 10:11:22 -0700	[thread overview]
Message-ID: <CAHk-=wgyk0mwYcMRC8HakzoAKL2Y3gwzD433tqKYYhV+r1PLnA@mail.gmail.com> (raw)
In-Reply-To: <2677927.DK6gFqPMyL@devpool47>

On Thu, Sep 16, 2021 at 5:17 AM Rolf Eike Beer <eb@emlix.com> wrote:
>
> Am Donnerstag, 16. September 2021, 12:12:48 CEST schrieb Tobias Ulmer:
> > > The redirection seems to be an important part of it. I now did:
> > >
> > > git ... 2>&1 | sha256sum
> >
> > I've tried to reproduce this since yesterday, but couldn't until now:
> >
> > 2>&1 made all the difference, took less than a minute.

So if that redirection is what matters, and what causes problems, I
can almost guarantee that the reason is very simple:

Your git repository (or more likely your upstream) has some problem,
it's getting reported on stderr, and because you mix stdout and stderr
with that '2>&1', you get randomly mixed output.

Then it depends on timing where the mixing happens.

Or rather, it depends on various different factors, like the buffering
done internally by stdio (where stdout generally will be
block-buffered, while stderr is usually line-buffered, which is why
you get odd mixing of the two).

But timing can be an effect particularly with "git ls-remote" and
friends, because you may get errors from the transport asynchronously.

So the different buffering ends up causing the effect of mixing things
in the middle of lines, while the timing differences due to the
asynchronous nature of the remote access pipeline will likely then
cause that odd mixing to be different.

End result: corrupted lines, and different sha256sum every time.

> > Running the same on Archlinux (5.13.13-arch1-1, 2.33.0) doesn't show the
> > problem.
> > This may well turn out not to be git, but a kernel issue.

Much more likely that the other box just doesn't have the error situation.

> since you have been hacking around in pipe.c recently, I fear this isn't
> entirely impossible. Have you any idea?

Almost certainly not the kernel. Kernel - and other - differences
could affect timing, of course, but the whole "2>&1" really is
fundamentally bogus.

If you don't have any errors, then the "2>&1" doesn't matter.

And if you *do* have errors, then by definition the "2>&1" will mix in
the errors with the output randomly and piping them together is
senseless.

Either way, it's wrong.

So what I'd suggest Tobias should do is

    git ... 2> err | sha256sum

which will send the errors to the "err" file. Take a look at that file
afterwards and see what is in it.

Basically, '2&>1" is almost never the right thing to do, unless you
explicitly don't care about the output and just want to suppress it.

So "2&>1 > /dev/null" is common and natural.

Of course, people also use it when they just want to eyeball the
errors mixed in, so doing that

   ... 2&>1 | less

thing isn't necessarily *wrong*, but it's somewhat dangerous and
confusing. Because when you do it you do need to be very aware of the
fact that the errors and output will be *mixed*. And the mixing will
not necessarily be at all sensible.

Finally: pipes on a low level guarantee certain atomicity constraints,
so if you do low-level "write()" calls of size PIPE_BUF or less, the
contents will not be interleaved randomly.  HOWEVER. That's only true
at that "write()" level. The moment you use <stdio> for your IO, you
have buffering inside of the standard IO libraries, and if your code
isn't explicitly very careful about it, using setbuf() and fflush()
and friends, you'll get that random mixing.

Anyway. That was a long email just to tell people it's almost
certainly user error, not the kernel.

            Linus