git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] clone of large repo fails when server closes idle SSH session during "Resolving deltas"
@ 2021-05-19  3:46 Greg Pflaum
  2021-05-19  9:34 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Greg Pflaum @ 2021-05-19  3:46 UTC (permalink / raw)
  To: git

Git's handling of the SSH session during "git clone" changed between Git
2.17.0 and 2.31.1, causing cloning of a large repo to fail when the server
closes the idle session during the "Resolving deltas" phase.

In the older version, git closes the SSH session after "Receiving objects"
and before "Resolving deltas". In the newer version, git doesn't close the
SSH session until after "Resolving deltas" finishes. In the newer version,
if "Resolving deltas" takes a long time (over 2 minutes for our large 
repo)
and the server closes the idle SSH session (60 seconds for our GitHub
Enterprise server), git displays the error message "Connection to
git.example.com closed by remote host", finishes resolving deltas, then 
exits
without checking out files.

Successful with Git 2.17.0:

> git clone git@git.example.com:org/bigrepo.git
Cloning into 'bigrepo'...
remote: Enumerating objects: 260, done.
remote: Counting objects: 100% (260/260), done.
remote: Compressing objects: 100% (170/170), done.
remote: Total 361839 (delta 74), reused 174 (delta 37), pack-reused 361579
Receiving objects: 100% (361839/361839), 4.47 GiB | 36.71 MiB/s, done.
Resolving deltas: 100% (252249/252249), done.
Checking out files: 100% (59466/59466), done.

Fails with Git 2.31.1:

git clone git@git.example.com:org/bigrepo.git
Cloning into 'bigrepo'...
remote: Enumerating objects: 363, done.
remote: Counting objects: 100% (363/363), done.
remote: Compressing objects: 100% (221/221), done.
Receiving objects: 100% (361930/361930), 4.47 GiB | 36.37 MiB/s, done.
RConnection to git.example.com closed by remote host.esolving deltas:  39% 
(98729/252244)
Resolving deltas: 100% (252244/252244), done.

What did you do before the bug happened? (Steps to reproduce your issue)

Ran "git clone git@git.example.com:org/bigrepo.git"

What did you expect to happen? (Expected behavior)

Expected successful creation of a cloned repo.

What happened instead? (Actual behavior)

Repo was partially created, but files were not checked out. Branches and 
tags
also seem to be missing from the repo.

What's different between what you expected and what actually happened?

With the old Git version, cloning the repo is successful. With the newer
version it fails.

Anything else you want to add:

A workaround is to configure the SSH client to prevent the server from
seeing an idle session, by adding these lines to .ssh/config:

Host git.example.com
    # Send keep-alive message to server every 30 seconds
    # so the server won't close an idle session.
    ServerAliveInterval 30


[System Info]
git version:
git version 2.31.1.windows.1
cpu: x86_64
built from commit: c5f0be26a7e3846e3b6268d1c6c4800d838c6bbb
sizeof-long: 4
sizeof-size_t: 8
shell-path: /bin/sh
feature: fsmonitor--daemon
uname: Windows 10.0 14393
compiler info: gnuc: 10.2
libc info: no libc information available
$SHELL (typically, interactive shell): <unset>


[Enabled Hooks]
not run from a git repository - no hooks to show


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] clone of large repo fails when server closes idle SSH session during "Resolving deltas"
  2021-05-19  3:46 [BUG] clone of large repo fails when server closes idle SSH session during "Resolving deltas" Greg Pflaum
@ 2021-05-19  9:34 ` Jeff King
  2021-05-19  9:57   ` Jeff King
  2021-05-19 16:11   ` [PATCH] fetch-pack: signal v2 server that we are done making requests Jeff King
  0 siblings, 2 replies; 4+ messages in thread
From: Jeff King @ 2021-05-19  9:34 UTC (permalink / raw)
  To: Greg Pflaum; +Cc: git

On Tue, May 18, 2021 at 11:46:17PM -0400, Greg Pflaum wrote:

> Git's handling of the SSH session during "git clone" changed between Git
> 2.17.0 and 2.31.1, causing cloning of a large repo to fail when the server
> closes the idle session during the "Resolving deltas" phase.

Interesting find.

During that phase, all communication with the server is finished. We're
not expecting it to say anything else, and I'd have actually expected us
to have hung up the connection.

But I can definitely reproduce the issue by fetching even a moderate
sized repository like git.git, and observing that ssh is still running
when we hit the "Resolving deltas" phase. And indeed, killing it at that
step shows the problem:

  $ git clone --bare git@github.com:git/git foo.git
  Cloning into bare repository 'foo.git'...
  remote: Enumerating objects: 307777, done.
  remote: Counting objects: 100% (114/114), done.
  remote: Compressing objects: 100% (48/48), done.
  remote: Total 307777 (delta 72), reused 94 (delta 66), pack-reused 307663
  Receiving objects: 100% (307777/307777), 159.14 MiB | 30.71 MiB/s, done.
  Resolving deltas: 100% (229729/229729), done.
  error: ssh died of signal 9

In another terminal I waited for it to hit "resolving" and ran:

  kill -9 $(ps ax | grep ssh | grep github | awk '{print $1}')

You can see that we did completely successfully receive the incoming
pack. It's just that we then do an over-eager check of ssh's exit code
and complain when disconnecting the transport.

I had a hunch that this was related to the v2 protocol (which became the
default between the two versions you mentioned). And indeed, running
"git clone -c protocol.version=0 clone ..." makes it go away.

So I'd guess that in the v0 protocol, we close the pipe going to ssh's
stdin (which does a half-duplex shutdown, and then when the server side
closes its pipe, ssh exits completely).

But in v2, we presumably don't. Which is not too surprising; v2's view
of ssh is much more as a transport over which it will make several
request/response pairs. So the caller would have to explicitly indicate
that this is the final request, and the transport can be terminated
after that. That doesn't seem too complex conceptually, but I worry
implementing it will run into conflicts with how the v2 code works.

Another side issue is that once the protocol conversation has finished,
I'm not sure if it's really useful for us to detect and complain about
ssh's exit code. We know the other side completed the conversation
successfully, and we have nothing left to ask it. So a fix for your
immediate pain would be to stop noticing that. I think the root issue is
still worth addressing, though; we are tying up network and local
resources with a useless to-be-closed ssh connection.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] clone of large repo fails when server closes idle SSH session during "Resolving deltas"
  2021-05-19  9:34 ` Jeff King
@ 2021-05-19  9:57   ` Jeff King
  2021-05-19 16:11   ` [PATCH] fetch-pack: signal v2 server that we are done making requests Jeff King
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff King @ 2021-05-19  9:57 UTC (permalink / raw)
  To: Greg Pflaum; +Cc: Taylor Blau, git

On Wed, May 19, 2021 at 05:34:40AM -0400, Jeff King wrote:

> Another side issue is that once the protocol conversation has finished,
> I'm not sure if it's really useful for us to detect and complain about
> ssh's exit code. We know the other side completed the conversation
> successfully, and we have nothing left to ask it. So a fix for your
> immediate pain would be to stop noticing that. I think the root issue is
> still worth addressing, though; we are tying up network and local
> resources with a useless to-be-closed ssh connection.

By the way, there's an interesting subtlety / bug related to this. While
"git clone" does return a failed exit code in this case, it leaves the
repository directory in place! And because no real error occurred with
the clone, you can use it as usual (though I think if it's a non-bare
clone, you'd need to run "git checkout" fill in the working tree).

Propagating the error code comes from aab179d937 (builtin/clone.c: don't
ignore transport_fetch_refs() errors, 2020-12-03). So prior to Git
v2.30.0, your case would kind-of work.

But:

  - I think that is just "clone"; a "git fetch" would always have
    propagated the error from transport_fetch_refs()

  - that commit was right to start propagating the error code from
    transport_fetch_refs(). While in this specific case, we happened to
    produce a useful repository directory, most other errors would not.

  - there _is_ a bug in aab179d937, though. When it sees the error it
    should clean up the repo directory. And that even happens
    automatically via an atexit() handler. But because rather than
    calling die() it jumps to cleanup code, it mistakenly sets the flag
    for "leave the directory in place".

  - any logic to ignore errors would have to be inside the transport
    code (i.e., it realizes that ssh exiting non-zero isn't a big deal
    anymore, and then transport_fetch_refs() still returns success).

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] fetch-pack: signal v2 server that we are done making requests
  2021-05-19  9:34 ` Jeff King
  2021-05-19  9:57   ` Jeff King
@ 2021-05-19 16:11   ` Jeff King
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff King @ 2021-05-19 16:11 UTC (permalink / raw)
  To: Greg Pflaum; +Cc: git

On Wed, May 19, 2021 at 05:34:40AM -0400, Jeff King wrote:

> On Tue, May 18, 2021 at 11:46:17PM -0400, Greg Pflaum wrote:
> 
> > Git's handling of the SSH session during "git clone" changed between Git
> > 2.17.0 and 2.31.1, causing cloning of a large repo to fail when the server
> > closes the idle session during the "Resolving deltas" phase.
> 
> Interesting find.
> 
> During that phase, all communication with the server is finished. We're
> not expecting it to say anything else, and I'd have actually expected us
> to have hung up the connection.

OK, here's the fix I came up with. It's a two-line code change, of
course, but I tried to lay out all of my thinking about why this is the
right thing to do. Apologies in advance for the length. :)

-- >8 --
Subject: [PATCH] fetch-pack: signal v2 server that we are done making requests

When fetching with the v0 protocol over ssh (or a local upload-pack with
pipes), the server closes the connection as soon as it is finished
sending the pack. So even though the client may still be operating on
the data via index-pack (e.g., resolving deltas, checking connectivity,
etc), the server has released all resources.

With the v2 protocol, however, the server considers the ssh session only
as a transport, with individual requests coming over it. After sending
the pack, it goes back to its main loop, waiting for another request to
come from the client. As a result, the ssh session hangs around until
the client process ends, which may be much later (because resolving
deltas, etc, may consume a lot of CPU).

This is bad for two reasons:

  - it's consuming resources on the server to leave open a connection
    that won't see any more use

  - if something bad happens to the ssh connection in the meantime (say,
    it gets killed by the network because it's idle, as happened in a
    real-world report), then ssh will exit non-zero, and we'll propagate
    the error up the stack.

The server is correct here not to hang up after serving the pack. The v2
protocol's design is meant to allow multiple requests like this, and
hanging up would be the wrong thing for a hypothetical client which was
planning to make more requests (though in practice, the git.git client
never would, and I doubt any other implementations would either).

The right thing is instead for the client to signal to the server that
it's not interested in making more requests. We can do that by closing
the pipe descriptor we use to write to ssh. This will propagate to the
server upload-pack as an EOF when it tries to read the next request (and
then it will close its half, and the whole connection will go away).

It's important to do this "half duplex" shutdown, because we have to do
it _before_ we actually receive the pack. This is an artifact of the way
fetch-pack and index-pack (or unpack-objects) interact. We hand the
connection off to index-pack (really, a sideband demuxer which feeds
it), and then wait until it returns. And it doesn't do that until it has
resolved all of the deltas in the pack, even though it was done reading
from the server long before.

So just closing the connection fully after index-pack returns would be
too late; we'd have held it open much longer than was necessary. And
teaching index-pack to close the connection is awkward. It's not even
seeing the whole conversation (the sideband demuxer is, but it doesn't
actually know what's in the packets, or when the end comes).

Note that this close() is happening deep within the transport code. It's
possible that a caller would want to perform other operations over the
same ssh transport after receiving the pack. But as of the current code,
none of the callers do, and there haven't been discussions of any plans
to change this. If we need to support that later, we can probably do so
by passing down a flag for "you're the last request on the transport;
it's OK to close" instead of the code just assuming that's true.

The description above all discusses v2 ssh, so it's worth thinking about
how this interacts with other protocols:

  - in v0 protocols, we could do the same half-duplex shutdown (it just
    goes into the v0 do_fetch_pack() instead). This does work, but since
    it doesn't have the same persistence problem in the first place,
    there's little reason to change it at this point.

  - local fetches against git-upload-pack on the same machine will
    behave the same as ssh (they are talking over two pipes, and see EOF
    on their input pipe)

  - fetches against git-daemon will run this same code, and close one of
    the descriptors. In practice, this won't do anything, since there
    our two descriptors are dups of each other, and not part of a
    half-duplex pair. The right thing would probably be to call
    shutdown(SHUT_WR) on it. I didn't bother with that here. It doesn't
    face the same error-code problem (since it's just a TCP connection),
    so it's really only an optimization problem. And git:// is not that
    widely used these days, and has less impact on server resources than
    an ssh termination.

  - v2 http doesn't suffer from this problem in the first place, as our
    pipes terminate at a local git-remote-https, which is passing data
    along as individual requests via curl. Probably curl is keeping the
    TCP/TLS connection open for more requests, and we might be able to
    tell it manually "hey, we are done making requests now". But I think
    that's much less important. It again doesn't suffer from the
    error-code problem, and HTTP keepalive is pretty well understood
    (importantly, the timeouts can be set low, because clients like curl
    know how to reconnect for subsequent requests if necessary). So it's
    probably not worth figuring out how to tell curl that we're done
    (though if we do, this patch is probably the first step anyway;
    fetch-pack closes the pipe back to remote-https, which would be the
    signal that it should tell curl we're done).

The code is pretty straightforward. We close the pipe at the right
moment, and set it to -1 to mark it as invalid. I modified the later
cleanup code to avoid calling close(-1). That's not strictly necessary,
since close(-1) is a noop, but hopefully makes things a bit more obvious
to a reader.

I suspect that trying to call more transport functions after the close()
(e.g., calling transport_fetch_refs() again) would fail, as it's not
smart enough to realize we need to re-open the ssh connection. But
that's already true when v0 is in use. And no current callers want to do
that (and again, the solution is probably a flag in the transport code
to keep things open, which can be added later).

There's no test here, as the situation it covers is inherently racy (the
question is when upload-pack exits, compared to when index-pack finishes
resolving deltas and exits). The rather gross shell snippet below does
recreate the problematic situation; when run on a sufficiently-large
repository (git.git works fine), it kills an "idle" upload-pack while
the client is resolving deltas, leading to a failed clone.

-- >8 --
(
	git clone --no-local --progress . foo.git 2>&1
	echo >&2 "clone exit code=$?"
) |
tr '\r' '\n' |
while read line
do
	case "$done,$line" in
	,Resolving*)
		echo "hit resolving deltas; killing upload-pack"
		killall -9 git-upload-pack
		done=t
		;;
	esac
done
-- >8 --

Reported-by: Greg Pflaum <greg.pflaum@pnp-hcl.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 fetch-pack.c | 9 +++++++++
 transport.c  | 6 ++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index c135635e34..b0c7be717c 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1645,6 +1645,15 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 			if (process_section_header(&reader, "packfile-uris", 1))
 				receive_packfile_uris(&reader, &packfile_uris);
 			process_section_header(&reader, "packfile", 0);
+
+			/*
+			 * this is the final request we'll make of the server;
+			 * do a half-duplex shutdown to indicate that they can
+			 * hang up as soon as the pack is sent.
+			 */
+			close(fd[1]);
+			fd[1] = -1;
+
 			if (get_pack(args, fd, pack_lockfiles,
 				     packfile_uris.nr ? &index_pack_args : NULL,
 				     sought, nr_sought, &fsck_options.gitmodules_found))
diff --git a/transport.c b/transport.c
index 6cf3da19eb..50f5830eb6 100644
--- a/transport.c
+++ b/transport.c
@@ -427,7 +427,8 @@ static int fetch_refs_via_pack(struct transport *transport,
 
 cleanup:
 	close(data->fd[0]);
-	close(data->fd[1]);
+	if (data->fd[1] >= 0)
+		close(data->fd[1]);
 	if (finish_connect(data->conn))
 		ret = -1;
 	data->conn = NULL;
@@ -869,7 +870,8 @@ static int disconnect_git(struct transport *transport)
 		if (data->got_remote_heads && !transport->stateless_rpc)
 			packet_flush(data->fd[1]);
 		close(data->fd[0]);
-		close(data->fd[1]);
+		if (data->fd[1] >= 0)
+			close(data->fd[1]);
 		finish_connect(data->conn);
 	}
 
-- 
2.32.0.rc0.424.g95d3bde94f


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-05-19 16:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19  3:46 [BUG] clone of large repo fails when server closes idle SSH session during "Resolving deltas" Greg Pflaum
2021-05-19  9:34 ` Jeff King
2021-05-19  9:57   ` Jeff King
2021-05-19 16:11   ` [PATCH] fetch-pack: signal v2 server that we are done making requests Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).