All of lore.kernel.org
 help / color / mirror / Atom feed
* t0301-credential-cache test failure on cygwin
@ 2022-07-07  1:50 Ramsay Jones
  2022-07-07  6:15 ` Junio C Hamano
  2022-07-07 18:12 ` Jeff King
  0 siblings, 2 replies; 15+ messages in thread
From: Ramsay Jones @ 2022-07-07  1:50 UTC (permalink / raw)
  To: GIT Mailing-list, Adam Dinwoodie; +Cc: Johannes Schindelin, Junio C Hamano


During the v2.34.0 development cycle, between -rc1 and -rc2, as a result of an
update to cygwin, test t0301-credential-cache started failing on cygwin (it had
never failed before then). This was noted in [1] and later Adam (cygwin git
maintainer) confirmed this was due to a change in cygwin (there had been many
changes to the 'pipe' code).

Given that 'ssh-agent' provides all of my 'password' keeping needs, I have to
admit that, at the time, I was not overly concerned about the failure. :(
(Also, it was a cygwin failure, so no need to fix git).

However, I had some time to kill tonight, so I decided to take a _quick_ look
to see if there was something that could be done ... (famous last words).

  $ cd t
  $ ./t0301-credential-cache.sh -i -v -x

  ...

  expecting success of 0301.13 'socket defaults to ~/.cache/git/credential/socket':
          test_when_finished "
                  git credential-cache exit &&
                  rmdir -p .cache/git/credential/
          " &&
          test_path_is_missing "$HOME/.git-credential-cache" &&
          test_path_is_socket "$HOME/.cache/git/credential/socket"
  
  ++ test_when_finished '
                  git credential-cache exit &&
                  rmdir -p .cache/git/credential/
          '
  ++ test 0 = 0
  ++ test_cleanup='{
                  git credential-cache exit &&
                  rmdir -p .cache/git/credential/
  
                  } && (exit "$eval_ret"); eval_ret=$?; :'
  ++ test_path_is_missing '/home/ramsay/git/t/trash directory.t0301-credential-cache/.git-credential-cache'
  ++ test 1 -ne 1
  ++ test -e '/home/ramsay/git/t/trash directory.t0301-credential-cache/.git-credential-cache'
  ++ test_path_is_socket '/home/ramsay/git/t/trash directory.t0301-credential-cache/.cache/git/credential/socket'
  ++ test -S '/home/ramsay/git/t/trash directory.t0301-credential-cache/.cache/git/credential/socket'
  ++ git credential-cache exit
  fatal: read error from cache daemon: Software caused connection abort
  ++ eval_ret=128
  ++ :
  not ok 13 - socket defaults to ~/.cache/git/credential/socket
  #
  #               test_when_finished "
  #                       git credential-cache exit &&
  #                       rmdir -p .cache/git/credential/
  #               " &&
  #               test_path_is_missing "$HOME/.git-credential-cache" &&
  #               test_path_is_socket "$HOME/.cache/git/credential/socket"
  #
  1..13
  ++ git credential-cache exit
  ++ exit 128
  ++ eval_ret=128
  ++ :
  $

Around the time of v2.34.0-rc2, problems with the cygwin 'unix-stream-socket'
emulation caused some t0052-simple-ipc tests to fail (indeed the mailing-list
thread containing [1] was actually related to that failure). These problems
were fixed in commit 974ef7ced2 (simple-ipc: work around issues with Cygwin's
Unix socket emulation, 2021-11-10).

So, I was half expecting the failure to be noted at the 'test_path_is_socket'
line in the test. As you can see, it actually fails in the 'test_when_finished'
block with the 'git credential-cache exit', and the failure given as:

  fatal: read error from cache daemon: Software caused connection abort

looking at builtin/credential-cache.c, lines 40-66, we see:

    40	static int send_request(const char *socket, const struct strbuf *out)
    41	{
    42		int got_data = 0;
    43		int fd = unix_stream_connect(socket, 0);
    44	
    45		if (fd < 0)
    46			return -1;
    47	
    48		if (write_in_full(fd, out->buf, out->len) < 0)
    49			die_errno("unable to write to cache daemon");
    50		shutdown(fd, SHUT_WR);
    51	
    52		while (1) {
    53			char in[1024];
    54			int r;
    55	
    56			r = read_in_full(fd, in, sizeof(in));
    57			if (r == 0 || (r < 0 && connection_closed(errno)))
    58				break;
    59			if (r < 0)
    60				die_errno("read error from cache daemon");
    61			write_or_die(1, in, r);
    62			got_data = 1;
    63		}
    64		close(fd);
    65		return got_data;
    66	}

So, the read_in_full() call is returning an error, the error code is not
one that connection_closed() recognised and die_errno() is called at
line #60 (with errno=113).

Looking in the <errno.h> header file, we see:

  #define ECONNABORTED 113        /* Software caused connection abort */

Now, I find this a little odd, since most descriptions of ECONNABORTED
indicate that it is an error caused by the client but returned to the
server from the accept() call. This is definitely not what is happening
here. :)

Anyway, the somewhat loose meanings assigned to error codes aside, one
simple solution would be:

  diff --git a/builtin/credential-cache.c b/builtin/credential-cache.c
  index 78c02ad531..84fd513c62 100644
  --- a/builtin/credential-cache.c
  +++ b/builtin/credential-cache.c
  @@ -27,7 +27,7 @@ static int connection_fatally_broken(int error)
   
   static int connection_closed(int error)
   {
  -	return (error == ECONNRESET);
  +	return (error == ECONNRESET) || (error == ECONNABORTED);
   }
   
   static int connection_fatally_broken(int error)
  --

.. which does, indeed, 'fix' the problem. (Well, it side-steps the problem,
really).

Having deleted the above patch, I now had a look at the server side. Tracing
out the server execution showed no surprises - everything progressed as one
would expect and it 'exit(0)'-ed correctly! The relevant part of the code to
process a client request (in the serve_one_client() function, lines 132-142
in builtin/credential-cache--daemon.c) looks like:

	else if (!strcmp(action.buf, "exit")) {
		/*
		 * It's important that we clean up our socket first, and then
		 * signal the client only once we have finished the cleanup.
		 * Calling exit() directly does this, because we clean up in
		 * our atexit() handler, and then signal the client when our
		 * process actually ends, which closes the socket and gives
		 * them EOF.
		 */
		exit(0);
	}

Now, the comment doesn't make clear to me why "it's important that we clean
up our socket first" and, indeed, whether 'socket' refers to the socket
descriptor or the socket file. In the past, all of my unix-stream-socket
servers have closed the socket descriptor and then unlink()-ed the socket
file before exit(), with no 'atexit' calls in sight (lightly cribbed from a
30+ years old Unix Network programming book by Stevens - or was it the Comer
book - or maybe the Comer and Stevens book - I forget!).

The C standard (see C11 7.22.4.4) says that the 'exit' function first calls
all functions registered by 'atexit', then all open streams with unwritten
buffered data are flushed, all open streams are closed, and all files created
by the 'tmpfile' function are removed.

The C standard can only refer to 'FILE *' streams, but POSIX says practically
the same, but in addition (see 'Consequences of Process Termination'), that
"All of the file descriptors, directory streams, conversion descriptors,
and message catalog descriptors open in the calling process shall be closed."

Note that the above descriptions do not nail down the _exact_ ordering of
certain cleanup operations. (eg do the file descriptors get closed in a
low->high, high->low or random order)? So, if the cleanup operations can
fail, depending on the order of operations, then the results could be
undefined.

Anyway, I started playing around with flushing/closing of 'FILE *' streams
before the 'exit' call, to change the order, relative to the socket-file
deletion in the 'atexit' function (or the closing of the listen-socket
descriptor, come to that). In particular, I found that if I were to close
the 'in'put stream, then the client would receive an EOF and exit normally
(ie no error return from read_in_full() above).

[fclose(in); fclose(out) also works, but fclose(out) on its own does not.
fflush() in various combinations did not work at all].

So, the following patch also provides a 'fix' for this issue (although it
would require a change to the comment for a _real_ patch):

  diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c
  index 4c6c89ab0d..556393498f 100644
  --- a/builtin/credential-cache--daemon.c
  +++ b/builtin/credential-cache--daemon.c
  @@ -138,6 +138,7 @@ static void serve_one_client(FILE *in, FILE *out)
   		 * process actually ends, which closes the socket and gives
   		 * them EOF.
   		 */
  +		fclose(in);
   		exit(0);
   	}
   	else if (!strcmp(action.buf, "erase"))
  --

Having noticed that the 'timeout' test was not failing, I decided to try
making the 'action=exit' code-path behave more like the timeout code, as
far as exiting the server is concerned. Indeed, you might ask why the
timeout code doesn't just 'exit(0)' as well ...

Anyway, the following patch does that, and it also provides a 'fix' for this
issue!

  diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c
  index 4c6c89ab0d..fb9b1e04a6 100644
  --- a/builtin/credential-cache--daemon.c
  +++ b/builtin/credential-cache--daemon.c
  @@ -114,11 +114,12 @@ static int read_request(FILE *fh, struct credential *c,
   	return 0;
   }
   
  -static void serve_one_client(FILE *in, FILE *out)
  +static int serve_one_client(FILE *in, FILE *out)
   {
   	struct credential c = CREDENTIAL_INIT;
   	struct strbuf action = STRBUF_INIT;
   	int timeout = -1;
  +	int serve = 1; /* ie continue to serve clients */
   
   	if (read_request(in, &c, &action, &timeout) < 0)
   		/* ignore error */ ;
  @@ -130,15 +131,8 @@ static void serve_one_client(FILE *in, FILE *out)
   		}
   	}
   	else if (!strcmp(action.buf, "exit")) {
  -		/*
  -		 * It's important that we clean up our socket first, and then
  -		 * signal the client only once we have finished the cleanup.
  -		 * Calling exit() directly does this, because we clean up in
  -		 * our atexit() handler, and then signal the client when our
  -		 * process actually ends, which closes the socket and gives
  -		 * them EOF.
  -		 */
  -		exit(0);
  +		/* stop serving clients */
  +		serve = 0;
   	}
   	else if (!strcmp(action.buf, "erase"))
   		remove_credential(&c);
  @@ -157,6 +151,7 @@ static void serve_one_client(FILE *in, FILE *out)
   
   	credential_clear(&c);
   	strbuf_release(&action);
  +	return serve;
   }
   
   static int serve_cache_loop(int fd)
  @@ -179,6 +174,7 @@ static int serve_cache_loop(int fd)
   	if (pfd.revents & POLLIN) {
   		int client, client2;
   		FILE *in, *out;
  +		int serve;
   
   		client = accept(fd, NULL, NULL);
   		if (client < 0) {
  @@ -194,9 +190,10 @@ static int serve_cache_loop(int fd)
   
   		in = xfdopen(client, "r");
   		out = xfdopen(client2, "w");
  -		serve_one_client(in, out);
  +		serve = serve_one_client(in, out);
   		fclose(in);
   		fclose(out);
  +		return serve;
   	}
   	return 1;
   }
  -- 

[Note the 'serve' variable was the least objectionable name I could come up
with, but it could probably be improved!]

So, we now have three patches which 'fix' the issue. What does this tell us?
Well, not an awful lot! ;-)

I repeat, until I updated to v3.3.2 of cygwin, none of these patches were
required, since the test did not fail. In order to progress, someone needs
to bisect the commits in the cygwin git repository to find the commit which
changed the behaviour of the unix stream socket emulation.

I had a look at the cygwin gitweb site [2], and started reading the commit
messages from 'cygwin-3_3_2-release' tag, but nothing stood out. Also, I did
not note (in [1]) which version of cygwin I updated _from_, so I don't know
where to stop! (Sometimes I update often, sometimes not for 6-9 months!
I have a vague feeling that I was in a reasonably often update tempo, so
there may be no need to go too far back).

[Also, I spent loads of time reading the code [3] to the 'fhandler_socket_unix'
class, only to notice that it is conditional on the '__WITH_AF_UNIX'
pre-processor variable. I can't find any code which #defines it, so ...]

[Also, in [1] I noted that the v3.3.2 update added an hour to the runtime
of the testsuite. If I run the test something like:

  $ export GIT_TEST_CHAIN_LINT=0
  $ export TEST_NO_MALLOC_CHECK=yes
  $ make test >test-out-2-37-rel 2>&1

.. then I can get back about 45min of that hour!

(I also have DEFAULT_TEST_TARGET=prove and GIT_PROVE_OPTS='--timer' in my
config.mak)]

Unfortunately, I can't be the one to investigate the change to cygwin, so I
will have to leave that to someone with the necessary knowledge/skill with
the cygwin codebase. (The last time I built/debugged the cygwin .dll, the
repo was in cvs! Yes, many, many years ago. If I remember correctly, building
the .dll wasn't really the problem - it was the testing that was the *huge*
PITA, especially if you have services running).

It seems unlikely, but it is possible (despite what I said earlier), that the
fault lies with git - it could be relying on implementation behaviour. Until
we know how cygwin changed, we will not be able to determine a suitable fix.

Hmm, I will send the three RFC patches to the list, just to make it slightly
easier for someone to try them out, if they wish.

Sorry for not looking at this earlier and for not finding a suitable fix.

It's getting late ...


ATB,
Ramsay Jones

[1] https://lore.kernel.org/git/02b6cfa8-bd0a-275d-fce0-a3a9f316ded3@ramsayjones.plus.com/
[2] https://cygwin.com/git/newlib-cygwin.git
[3] https://cygwin.com/git/?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/fhandler_socket_unix.cc;h=8abb581b999e79ad71ddd19a1557e2be0fd9d757;hb=5cc4b92c25af1af9250e4498e60943fcc82b7487



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07  1:50 t0301-credential-cache test failure on cygwin Ramsay Jones
@ 2022-07-07  6:15 ` Junio C Hamano
  2022-07-07 15:17   ` Ramsay Jones
  2022-07-07 18:12 ` Jeff King
  1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2022-07-07  6:15 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: GIT Mailing-list, Adam Dinwoodie, Johannes Schindelin

Ramsay Jones <ramsay@ramsayjones.plus.com> writes:

> However, I had some time to kill tonight, so I decided to take a _quick_ look
> to see if there was something that could be done ... (famous last words).
> ...
>   diff --git a/builtin/credential-cache.c b/builtin/credential-cache.c
>   index 78c02ad531..84fd513c62 100644
>   --- a/builtin/credential-cache.c
>   +++ b/builtin/credential-cache.c
>   @@ -27,7 +27,7 @@ static int connection_fatally_broken(int error)
>    
>    static int connection_closed(int error)
>    {
>   -	return (error == ECONNRESET);
>   +	return (error == ECONNRESET) || (error == ECONNABORTED);
>    }

This feels like papering over the problem.

> Having noticed that the 'timeout' test was not failing, I decided to try
> making the 'action=exit' code-path behave more like the timeout code, as
> far as exiting the server is concerned. Indeed, you might ask why the
> timeout code doesn't just 'exit(0)' as well ...
>
> Anyway, the following patch does that, and it also provides a 'fix' for this
> issue!

If this codepath was written like this (i.e. [PATCH 1C]) from the
beginning, I would have found it very sensible (i.e. instead of
caling exit() in the middle of the infinite client serving loop,
exiting the loop cleanly is easier to follow and maintain), even if
we didn't know the issue on Cygwin you investigated.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07  6:15 ` Junio C Hamano
@ 2022-07-07 15:17   ` Ramsay Jones
  2022-07-07 18:19     ` Jeff King
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Ramsay Jones @ 2022-07-07 15:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: GIT Mailing-list, Adam Dinwoodie, Johannes Schindelin, Jeff King


[
Jeff: sorry for not CC:-ing you on the original email - I had intended
to do just that, but forgot! :(

See: https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
]

On 07/07/2022 07:15, Junio C Hamano wrote:
> Ramsay Jones <ramsay@ramsayjones.plus.com> writes:
> 
>> However, I had some time to kill tonight, so I decided to take a _quick_ look
>> to see if there was something that could be done ... (famous last words).
>> ...
>>   diff --git a/builtin/credential-cache.c b/builtin/credential-cache.c
>>   index 78c02ad531..84fd513c62 100644
>>   --- a/builtin/credential-cache.c
>>   +++ b/builtin/credential-cache.c
>>   @@ -27,7 +27,7 @@ static int connection_fatally_broken(int error)
>>    
>>    static int connection_closed(int error)
>>    {
>>   -	return (error == ECONNRESET);
>>   +	return (error == ECONNRESET) || (error == ECONNABORTED);
>>    }
> 
> This feels like papering over the problem.

Agreed, ... which is what I really meant by "(Well, it side-steps the
problem, really)."

>> Having noticed that the 'timeout' test was not failing, I decided to try
>> making the 'action=exit' code-path behave more like the timeout code, as
>> far as exiting the server is concerned. Indeed, you might ask why the
>> timeout code doesn't just 'exit(0)' as well ...
>>
>> Anyway, the following patch does that, and it also provides a 'fix' for this
>> issue!
> 
> If this codepath was written like this (i.e. [PATCH 1C]) from the
> beginning, I would have found it very sensible (i.e. instead of
> caling exit() in the middle of the infinite client serving loop,
> exiting the loop cleanly is easier to follow and maintain), even if
> we didn't know the issue on Cygwin you investigated.

Yep, apart from the variable name, I quite like the approach taken by
the 1C patch.

All three of these patches were really just "showing my working" and
allowing anyone to "follow along" without the hassle of trying to
scrape the diffs from the email.

As I said, I don't think we can determine a suitable fix without first
finding the cygwin commit which caused this test failure. But if we
can't determine this, for whatever reason, then I would favour a patch
to git based on the 1C patch. (Writing the commit message to justify the
change, without mentioning this cygwin issue, may be more challenging! :)

Also, I would like to understand why the code is written as it is
currently. I'm sure there must be a good reason - I just don't know
what it is! I suspect (ie I'm guessing), it has something to do with
operating in a high contention context [TOCTOU on socket?] ... dunno. ;-)

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07  1:50 t0301-credential-cache test failure on cygwin Ramsay Jones
  2022-07-07  6:15 ` Junio C Hamano
@ 2022-07-07 18:12 ` Jeff King
  2022-07-07 18:26   ` Junio C Hamano
  2022-07-07 19:59   ` Ramsay Jones
  1 sibling, 2 replies; 15+ messages in thread
From: Jeff King @ 2022-07-07 18:12 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: GIT Mailing-list, Adam Dinwoodie, Johannes Schindelin, Junio C Hamano

On Thu, Jul 07, 2022 at 02:50:21AM +0100, Ramsay Jones wrote:

> Having deleted the above patch, I now had a look at the server side. Tracing
> out the server execution showed no surprises - everything progressed as one
> would expect and it 'exit(0)'-ed correctly! The relevant part of the code to
> process a client request (in the serve_one_client() function, lines 132-142
> in builtin/credential-cache--daemon.c) looks like:
> 
> 	else if (!strcmp(action.buf, "exit")) {
> 		/*
> 		 * It's important that we clean up our socket first, and then
> 		 * signal the client only once we have finished the cleanup.
> 		 * Calling exit() directly does this, because we clean up in
> 		 * our atexit() handler, and then signal the client when our
> 		 * process actually ends, which closes the socket and gives
> 		 * them EOF.
> 		 */
> 		exit(0);
> 	}
> 
> Now, the comment doesn't make clear to me why "it's important that we clean
> up our socket first" and, indeed, whether 'socket' refers to the socket
> descriptor or the socket file. In the past, all of my unix-stream-socket
> servers have closed the socket descriptor and then unlink()-ed the socket
> file before exit(), with no 'atexit' calls in sight (lightly cribbed from a
> 30+ years old Unix Network programming book by Stevens - or was it the Comer
> book - or maybe the Comer and Stevens book - I forget!).

That comment refers to the socket file. If we close the handle to the
client before we clean up the socket file, then the client may finish
while the socket file is still there. So anybody expecting that:

  git credential-cache exit

is a sequencing operation will be fooled. One obvious thing is:

  git credential-cache exit
  git credential-cache store <some-cred

which is now racy; the second command may try to contact the socket for
the exiting daemon. It might actually handle that gracefully (because
the server wouldn't actually accept()) but I didn't check. But another
example, and the one that motivated that comment is:

  git credential-cache exit
  test_path_is_missing $HOME/.git-credential-cache/socket

which is exactly what our tests do. ;) See the discussion around here:

  https://lore.kernel.org/git/20160318061201.GA28102@sigill.intra.peff.net/

A few messages up we even discuss adding an fclose(). :)

> The C standard (see C11 7.22.4.4) says that the 'exit' function first calls
> all functions registered by 'atexit', then all open streams with unwritten
> buffered data are flushed, all open streams are closed, and all files created
> by the 'tmpfile' function are removed.

Right. That's exactly the order we're relying on: our atexit cleans up
the socket and _then_ the descriptor to the client is closed.

I think we could actually just call delete_tempfile() explicitly. Back
when the code was originally written, that would require making sure our
atexit() handler didn't do a double-cleanup, but these days it uses the
tempfile code, which I believe does the right thing.

I think that still wouldn't solve your cygwin problem, though; from the
client perspective the behavior would be the same.

> Anyway, I started playing around with flushing/closing of 'FILE *' streams
> before the 'exit' call, to change the order, relative to the socket-file
> deletion in the 'atexit' function (or the closing of the listen-socket
> descriptor, come to that). In particular, I found that if I were to close
> the 'in'put stream, then the client would receive an EOF and exit normally
> (ie no error return from read_in_full() above).

Right, but now you've introduced the race discussed above.

> Having noticed that the 'timeout' test was not failing, I decided to try
> making the 'action=exit' code-path behave more like the timeout code, as
> far as exiting the server is concerned.

I think this introduces a similar race.

> Indeed, you might ask why the timeout code doesn't just 'exit(0)' as well ...

It doesn't matter there, because there's no client expecting us to exit.
So the sequence of cleaning up our socket vs hanging up on the client
doesn't matter. And indeed, it's unlikely to even _have_ a client here,
as they'd have to connect and do their business right as our final
credential was hitting its expiration.

> So, we now have three patches which 'fix' the issue. What does this tell us?
> Well, not an awful lot! ;-)

Of the three, I actually like the client-side one to check errno the
best. The client is mostly "best effort". If it can't talk to the daemon
for whatever reason, then it becomes a noop (there is nothing it can
retrieve from the cache, and if it's trying to write, then oh well, the
cached value was immediately expired!).

So one could argue that _every_ read error should be silently ignored.
Calling die_errno() is mostly a nicety for debugging a broken setup, but
in normal use, the outcome is the same either way (and Git will
certainly ignore the exit code credential-cache anyway). I prefer the
"ignore known harmless errors" approach, possibly because I am often the
one debugging. ;) If ECONNABORTED is a harmless error we see in
practice, I don't mind adding it to the list (under the same rationale
as the current ECONNRESET that is there).

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07 15:17   ` Ramsay Jones
@ 2022-07-07 18:19     ` Jeff King
  2022-07-07 18:29     ` Jeff King
  2022-07-11  7:49     ` Adam Dinwoodie
  2 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2022-07-07 18:19 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Junio C Hamano, GIT Mailing-list, Adam Dinwoodie, Johannes Schindelin

On Thu, Jul 07, 2022 at 04:17:08PM +0100, Ramsay Jones wrote:

> > If this codepath was written like this (i.e. [PATCH 1C]) from the
> > beginning, I would have found it very sensible (i.e. instead of
> > caling exit() in the middle of the infinite client serving loop,
> > exiting the loop cleanly is easier to follow and maintain), even if
> > we didn't know the issue on Cygwin you investigated.
> 
> Yep, apart from the variable name, I quite like the approach taken by
> the 1C patch.
> [...]
> Also, I would like to understand why the code is written as it is
> currently. I'm sure there must be a good reason - I just don't know
> what it is! I suspect (ie I'm guessing), it has something to do with
> operating in a high contention context [TOCTOU on socket?] ... dunno. ;-)

I wrote a longer reply in the thread, but just to be clear here: your 1C
will indeed introduce a race.

IMHO it is not worth switching away from the current code which calls
exit() to return up the stock. But if you wanted to do so without
introducing a race, I think you could call delete_tempfile() before
closing any streams, like this (on top of your 1C):

diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c
index fb9b1e04a6..9b9cc1b70e 100644
--- a/builtin/credential-cache--daemon.c
+++ b/builtin/credential-cache--daemon.c
@@ -131,6 +131,7 @@ static int serve_one_client(FILE *in, FILE *out)
 		}
 	}
 	else if (!strcmp(action.buf, "exit")) {
+		delete_tempfile(&socket_file);
 		/* stop serving clients */
 		serve = 0;
 	}

but you'd have to make the socket_file struct globally available. And of
course it does not fix your cygwin problem, which I believe is mutually
exclusive with keeping the race-free ordering. ;)

-Peff

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07 18:12 ` Jeff King
@ 2022-07-07 18:26   ` Junio C Hamano
  2022-07-07 19:59   ` Ramsay Jones
  1 sibling, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2022-07-07 18:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Ramsay Jones, GIT Mailing-list, Adam Dinwoodie, Johannes Schindelin

Jeff King <peff@peff.net> writes:

> Of the three, I actually like the client-side one to check errno the
> best. The client is mostly "best effort". If it can't talk to the daemon
> for whatever reason, then it becomes a noop (there is nothing it can
> retrieve from the cache, and if it's trying to write, then oh well, the
> cached value was immediately expired!).
>
> So one could argue that _every_ read error should be silently ignored.
> Calling die_errno() is mostly a nicety for debugging a broken setup, but
> in normal use, the outcome is the same either way (and Git will
> certainly ignore the exit code credential-cache anyway). I prefer the
> "ignore known harmless errors" approach, possibly because I am often the
> one debugging. ;) If ECONNABORTED is a harmless error we see in
> practice, I don't mind adding it to the list (under the same rationale
> as the current ECONNRESET that is there).

With the issue of race clearly explained, I agree that this would be
the best solution for this issue.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07 15:17   ` Ramsay Jones
  2022-07-07 18:19     ` Jeff King
@ 2022-07-07 18:29     ` Jeff King
  2022-07-07 19:14       ` Ramsay Jones
  2022-07-11  7:49     ` Adam Dinwoodie
  2 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2022-07-07 18:29 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Junio C Hamano, GIT Mailing-list, Adam Dinwoodie, Johannes Schindelin

On Thu, Jul 07, 2022 at 04:17:08PM +0100, Ramsay Jones wrote:

> Also, I would like to understand why the code is written as it is
> currently. I'm sure there must be a good reason - I just don't know
> what it is! I suspect (ie I'm guessing), it has something to do with
> operating in a high contention context [TOCTOU on socket?] ... dunno. ;-)

By the way, I was slightly surprised you did not find the explanation in
the commit history. A blame[1] of credential-cache--daemon.c shows that
the comment was added by 7d5e9c9849 (credential-cache--daemon: clarify
"exit" action semantics, 2016-03-18) which mentions the race in the
tests. And then searching for that commit message in the list yields the
thread I linked earlier with more context[2].

I mention this not as a criticism, because your digging for backstory
was otherwise quite thorough. It's only _because_ it was so thorough
that I was surprised you didn't find that commit. ;) So I offer it only
as a suggestion for future digging.

-Peff

[1] Likewise this works:

       git log --follow -S'important that we clean up' builtin/credential-cache--daemon.c

    but the --follow is necessary because of the rename when it became a
    builtin. It's cool that "blame" handles this seamlessly. :)

[2] Finding the original patch on the list is my go-to trick when a
    commit message hasn't sufficiently explained things. I know Junio
    was for a while (is still?) kept a git-notes mapping of commits to
    emails. In practice, I usually just do a manual search for a few
    unique-looking phrases from the commit message. That should be
    pretty easy with lore these days, though I do it with a local
    archive indexed by notmuch.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07 18:29     ` Jeff King
@ 2022-07-07 19:14       ` Ramsay Jones
  0 siblings, 0 replies; 15+ messages in thread
From: Ramsay Jones @ 2022-07-07 19:14 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, GIT Mailing-list, Adam Dinwoodie, Johannes Schindelin



On 07/07/2022 19:29, Jeff King wrote:
> On Thu, Jul 07, 2022 at 04:17:08PM +0100, Ramsay Jones wrote:
> 
>> Also, I would like to understand why the code is written as it is
>> currently. I'm sure there must be a good reason - I just don't know
>> what it is! I suspect (ie I'm guessing), it has something to do with
>> operating in a high contention context [TOCTOU on socket?] ... dunno. ;-)
> 
> By the way, I was slightly surprised you did not find the explanation in
> the commit history. A blame[1] of credential-cache--daemon.c shows that
> the comment was added by 7d5e9c9849 (credential-cache--daemon: clarify
> "exit" action semantics, 2016-03-18) which mentions the race in the
> tests. And then searching for that commit message in the list yields the
> thread I linked earlier with more context[2].

Heh, just 10min before I read your previous email (er, actually your
previous, previous email) I found commit 7d5e9c9849 (credential-cache--daemon:
clarify "exit" action semantics, 2016-03-18). I hadn't had time to dig
any further yet (as usual I'm trying to do 3 things at the same time).

What can I say? It was nearly 3am and I wanted to go bed. It took nearly
50min just to write the email. This was just a _quick_ look remember! ;-)

Sorry!

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07 18:12 ` Jeff King
  2022-07-07 18:26   ` Junio C Hamano
@ 2022-07-07 19:59   ` Ramsay Jones
  1 sibling, 0 replies; 15+ messages in thread
From: Ramsay Jones @ 2022-07-07 19:59 UTC (permalink / raw)
  To: Jeff King
  Cc: GIT Mailing-list, Adam Dinwoodie, Johannes Schindelin, Junio C Hamano



On 07/07/2022 19:12, Jeff King wrote:
> On Thu, Jul 07, 2022 at 02:50:21AM +0100, Ramsay Jones wrote:
> 
>> Having deleted the above patch, I now had a look at the server side. Tracing
>> out the server execution showed no surprises - everything progressed as one
>> would expect and it 'exit(0)'-ed correctly! The relevant part of the code to
>> process a client request (in the serve_one_client() function, lines 132-142
>> in builtin/credential-cache--daemon.c) looks like:
>>
>> 	else if (!strcmp(action.buf, "exit")) {
>> 		/*
>> 		 * It's important that we clean up our socket first, and then
>> 		 * signal the client only once we have finished the cleanup.
>> 		 * Calling exit() directly does this, because we clean up in
>> 		 * our atexit() handler, and then signal the client when our
>> 		 * process actually ends, which closes the socket and gives
>> 		 * them EOF.
>> 		 */
>> 		exit(0);
>> 	}
>>
>> Now, the comment doesn't make clear to me why "it's important that we clean
>> up our socket first" and, indeed, whether 'socket' refers to the socket
>> descriptor or the socket file. In the past, all of my unix-stream-socket
>> servers have closed the socket descriptor and then unlink()-ed the socket
>> file before exit(), with no 'atexit' calls in sight (lightly cribbed from a
>> 30+ years old Unix Network programming book by Stevens - or was it the Comer
>> book - or maybe the Comer and Stevens book - I forget!).
> 
> That comment refers to the socket file. If we close the handle to the
> client before we clean up the socket file, then the client may finish
> while the socket file is still there. So anybody expecting that:
> 
>   git credential-cache exit
> 
> is a sequencing operation will be fooled. One obvious thing is:
> 
>   git credential-cache exit
>   git credential-cache store <some-cred
> 
> which is now racy; the second command may try to contact the socket for
> the exiting daemon. It might actually handle that gracefully (because
> the server wouldn't actually accept()) but I didn't check. But another
> example, and the one that motivated that comment is:
> 
>   git credential-cache exit
>   test_path_is_missing $HOME/.git-credential-cache/socket
> 
> which is exactly what our tests do. ;) See the discussion around here:
> 
>   https://lore.kernel.org/git/20160318061201.GA28102@sigill.intra.peff.net/

I have now read (much) of that thread and it makes sense now. (Yes, I probably
should have postponed sending the email until after researching some more today;
lesson learned).

[snip]
>> So, we now have three patches which 'fix' the issue. What does this tell us?
>> Well, not an awful lot! ;-)
> 
> Of the three, I actually like the client-side one to check errno the
> best. The client is mostly "best effort". If it can't talk to the daemon
> for whatever reason, then it becomes a noop (there is nothing it can
> retrieve from the cache, and if it's trying to write, then oh well, the
> cached value was immediately expired!).
> 
> So one could argue that _every_ read error should be silently ignored.
> Calling die_errno() is mostly a nicety for debugging a broken setup, but
> in normal use, the outcome is the same either way (and Git will
> certainly ignore the exit code credential-cache anyway). I prefer the
> "ignore known harmless errors" approach, possibly because I am often the
> one debugging. ;) If ECONNABORTED is a harmless error we see in
> practice, I don't mind adding it to the list (under the same rationale
> as the current ECONNRESET that is there).

Yes, I was going to ask about ECONNRESET ... heh, no I'm kidding! :)

Yeah, if we can't determine the reason for cygwin changing behaviour
here (and fix it in cygwin), then this is probably the simplest solution.

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-07 15:17   ` Ramsay Jones
  2022-07-07 18:19     ` Jeff King
  2022-07-07 18:29     ` Jeff King
@ 2022-07-11  7:49     ` Adam Dinwoodie
  2022-07-11 13:39       ` Adam Dinwoodie
  2 siblings, 1 reply; 15+ messages in thread
From: Adam Dinwoodie @ 2022-07-11  7:49 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Junio C Hamano, GIT Mailing-list, Johannes Schindelin, Jeff King

On Thu, 7 Jul 2022 at 16:17, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:
> [
> Jeff: sorry for not CC:-ing you on the original email - I had intended
> to do just that, but forgot! :(
>
> See: https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
> ]
>
> On 07/07/2022 07:15, Junio C Hamano wrote:
> > Ramsay Jones <ramsay@ramsayjones.plus.com> writes:
> >
> >> However, I had some time to kill tonight, so I decided to take a _quick_ look
> >> to see if there was something that could be done ... (famous last words).
> >> ...
> >>   diff --git a/builtin/credential-cache.c b/builtin/credential-cache.c
> >>   index 78c02ad531..84fd513c62 100644
> >>   --- a/builtin/credential-cache.c
> >>   +++ b/builtin/credential-cache.c
> >>   @@ -27,7 +27,7 @@ static int connection_fatally_broken(int error)
> >>  
> >>    static int connection_closed(int error)
> >>    {
> >>   -  return (error == ECONNRESET);
> >>   +  return (error == ECONNRESET) || (error == ECONNABORTED);
> >>    }
> >
> > This feels like papering over the problem.
>
> Agreed, ... which is what I really meant by "(Well, it side-steps the
> problem, really)."
>
> >> Having noticed that the 'timeout' test was not failing, I decided to try
> >> making the 'action=exit' code-path behave more like the timeout code, as
> >> far as exiting the server is concerned. Indeed, you might ask why the
> >> timeout code doesn't just 'exit(0)' as well ...
> >>
> >> Anyway, the following patch does that, and it also provides a 'fix' for this
> >> issue!
> >
> > If this codepath was written like this (i.e. [PATCH 1C]) from the
> > beginning, I would have found it very sensible (i.e. instead of
> > caling exit() in the middle of the infinite client serving loop,
> > exiting the loop cleanly is easier to follow and maintain), even if
> > we didn't know the issue on Cygwin you investigated.
>
> Yep, apart from the variable name, I quite like the approach taken by
> the 1C patch.
>
> All three of these patches were really just "showing my working" and
> allowing anyone to "follow along" without the hassle of trying to
> scrape the diffs from the email.
>
> As I said, I don't think we can determine a suitable fix without first
> finding the cygwin commit which caused this test failure. But if we
> can't determine this, for whatever reason, then I would favour a patch
> to git based on the 1C patch. (Writing the commit message to justify the
> change, without mentioning this cygwin issue, may be more challenging! :)

I've been trying to dig into this; I've essentially never played with
the code for Cygwin itself until now, but I suspect I'm probably one of
the best-placed folks to actually do that investigation.

Unfortunately, I've gone as far back as 18 December 2020 in the code for
the Cygwin DLL itself, and I'm still seeing t0301 failing in exactly the
same way.  There's a few possible explanations for that, but my guess is
either (a) the issue isn't in the Cygwin DLL itself but in some other
library that was updated around the same time, or (b) I'm not managing
as clean a build as I'm aiming for, and my builds of the old Cygwin
commits are being polluted by something in my current environment.

Either way I think I can make progress: my next step is to (temporarily)
give up on bisecting by commit in the repository that tracks the Cygwin
DLL, and instead bisect by time using the Cygwin Time Machine, which
should let me get an entire Cygwin environment as it would have been at
some point in the past.

> Also, I would like to understand why the code is written as it is
> currently. I'm sure there must be a good reason - I just don't know
> what it is! I suspect (ie I'm guessing), it has something to do with
> operating in a high contention context [TOCTOU on socket?] ... dunno. ;-)
>
> ATB,
> Ramsay Jones

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-11  7:49     ` Adam Dinwoodie
@ 2022-07-11 13:39       ` Adam Dinwoodie
  2022-07-11 14:56         ` Ramsay Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Adam Dinwoodie @ 2022-07-11 13:39 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Junio C Hamano, GIT Mailing-list, Johannes Schindelin, Jeff King

On Mon, 11 Jul 2022 at 08:50, Adam Dinwoodie <adam@dinwoodie.org> wrote:
>
> On Thu, 7 Jul 2022 at 16:17, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:
> > [
> > Jeff: sorry for not CC:-ing you on the original email - I had intended
> > to do just that, but forgot! :(
> >
> > See: https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
> > ]
> >
> > On 07/07/2022 07:15, Junio C Hamano wrote:
> > > Ramsay Jones <ramsay@ramsayjones.plus.com> writes:
> > >
> > >> However, I had some time to kill tonight, so I decided to take a _quick_ look
> > >> to see if there was something that could be done ... (famous last words).
> > >> ...
> > >>   diff --git a/builtin/credential-cache.c b/builtin/credential-cache.c
> > >>   index 78c02ad531..84fd513c62 100644
> > >>   --- a/builtin/credential-cache.c
> > >>   +++ b/builtin/credential-cache.c
> > >>   @@ -27,7 +27,7 @@ static int connection_fatally_broken(int error)
> > >>
> > >>    static int connection_closed(int error)
> > >>    {
> > >>   -  return (error == ECONNRESET);
> > >>   +  return (error == ECONNRESET) || (error == ECONNABORTED);
> > >>    }
> > >
> > > This feels like papering over the problem.
> >
> > Agreed, ... which is what I really meant by "(Well, it side-steps the
> > problem, really)."
> >
> > >> Having noticed that the 'timeout' test was not failing, I decided to try
> > >> making the 'action=exit' code-path behave more like the timeout code, as
> > >> far as exiting the server is concerned. Indeed, you might ask why the
> > >> timeout code doesn't just 'exit(0)' as well ...
> > >>
> > >> Anyway, the following patch does that, and it also provides a 'fix' for this
> > >> issue!
> > >
> > > If this codepath was written like this (i.e. [PATCH 1C]) from the
> > > beginning, I would have found it very sensible (i.e. instead of
> > > caling exit() in the middle of the infinite client serving loop,
> > > exiting the loop cleanly is easier to follow and maintain), even if
> > > we didn't know the issue on Cygwin you investigated.
> >
> > Yep, apart from the variable name, I quite like the approach taken by
> > the 1C patch.
> >
> > All three of these patches were really just "showing my working" and
> > allowing anyone to "follow along" without the hassle of trying to
> > scrape the diffs from the email.
> >
> > As I said, I don't think we can determine a suitable fix without first
> > finding the cygwin commit which caused this test failure. But if we
> > can't determine this, for whatever reason, then I would favour a patch
> > to git based on the 1C patch. (Writing the commit message to justify the
> > change, without mentioning this cygwin issue, may be more challenging! :)
>
> I've been trying to dig into this; I've essentially never played with
> the code for Cygwin itself until now, but I suspect I'm probably one of
> the best-placed folks to actually do that investigation.
>
> Unfortunately, I've gone as far back as 18 December 2020 in the code for
> the Cygwin DLL itself, and I'm still seeing t0301 failing in exactly the
> same way.  There's a few possible explanations for that, but my guess is
> either (a) the issue isn't in the Cygwin DLL itself but in some other
> library that was updated around the same time, or (b) I'm not managing
> as clean a build as I'm aiming for, and my builds of the old Cygwin
> commits are being polluted by something in my current environment.
>
> Either way I think I can make progress: my next step is to (temporarily)
> give up on bisecting by commit in the repository that tracks the Cygwin
> DLL, and instead bisect by time using the Cygwin Time Machine, which
> should let me get an entire Cygwin environment as it would have been at
> some point in the past.

Minor progress update: I've now confirmed the failure was introduced by
a change in the Cygwin library between the binaries for Cygwin versions
3.2.0-1 and 3.3.1-1. Specifically, the test passes with Cygwin from the
27 October 2021 package archive[0], and fails with Cygwin from the 28
October 2021 archive[1], and the only difference between the two that
has any chance of being relevant is that bump in the Cygwin release.

Having confirmed that, I'll go back to trying to get the Cygwin builds
to work for me, so I can bisect the commit history between those
releases.

[0]: http://ctm.crouchingtigerhiddenfruitbat.org/pub/cygwin/circa/64bit/2021/10/27/155642/index.html
[1]: http://ctm.crouchingtigerhiddenfruitbat.org/pub/cygwin/circa/64bit/2021/10/28/174906/index.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-11 13:39       ` Adam Dinwoodie
@ 2022-07-11 14:56         ` Ramsay Jones
  2022-07-13 14:42           ` Adam Dinwoodie
  0 siblings, 1 reply; 15+ messages in thread
From: Ramsay Jones @ 2022-07-11 14:56 UTC (permalink / raw)
  To: Adam Dinwoodie
  Cc: Junio C Hamano, GIT Mailing-list, Johannes Schindelin, Jeff King



On 11/07/2022 14:39, Adam Dinwoodie wrote:
[snip]

> 
> Minor progress update: I've now confirmed the failure was introduced by
> a change in the Cygwin library between the binaries for Cygwin versions
> 3.2.0-1 and 3.3.1-1. Specifically, the test passes with Cygwin from the
> 27 October 2021 package archive[0], and fails with Cygwin from the 28
> October 2021 archive[1], and the only difference between the two that
> has any chance of being relevant is that bump in the Cygwin release.

Heh, I was just about to email you with similar news! I had a look at
my setup.log to see what I actually updated (and from what) and the
only thing that seemed to make sense was an update of the cygwin .dll
from v3.2.0-1 to v3.3.2-1 (I will add below an extract from my setup.log
for that day, in case you see anything else of interest).

[The previous entry was for 25th October 2021 and had some more interesting
entries, like binutils, libcurl, etc., but I don't think it is relevant]

Sorry I didn't think to look sooner.

Happy Hunting!

ATB,
Ramsay Jones

--- >8 ---

[snip]
2021/11/09 02:52:01 Augmented Transaction List:
2021/11/09 02:52:01    0 install cygwin               3.3.2-1  
2021/11/09 02:52:01    1   erase cygwin               3.2.0-1  
2021/11/09 02:52:01    2 install cygwin-debuginfo     3.3.2-1  
2021/11/09 02:52:01    3   erase cygwin-debuginfo     3.2.0-1  
2021/11/09 02:52:01    4 install cygwin-devel         3.3.2-1  
2021/11/09 02:52:01    5   erase cygwin-devel         3.2.0-1  
2021/11/09 02:52:01    6 install cygwin-doc           3.3.2-1  
2021/11/09 02:52:01    7   erase cygwin-doc           3.2.0-1  
2021/11/09 02:52:01    8 install perl-DateTime-Locale 1.33-1   
2021/11/09 02:52:01    9   erase perl-DateTime-Locale 1.32-1   
2021/11/09 02:52:01   10 install perl-URI             5.10-1   
2021/11/09 02:52:01   11   erase perl-URI             5.09-1   
2021/11/09 02:52:01   12 install libpcre2_8_0         10.39-1  
2021/11/09 02:52:01   13   erase libpcre2_8_0         10.38-1  
2021/11/09 02:52:01   14 install libpcre2_16_0        10.39-1  
2021/11/09 02:52:01   15   erase libpcre2_16_0        10.38-1  
2021/11/09 02:52:01   16 install libopenldap2_5_0     2.5.9-1  
2021/11/09 02:52:01   17   erase libopenldap2_5_0     2.5.8-1  
2021/11/09 02:52:01   18 install libidn12             1.38-1   
2021/11/09 02:52:01   19 install libarchive13         3.5.2-2  
2021/11/09 02:52:01   20   erase libarchive13         3.5.2-1  
2021/11/09 02:52:01   21 install git                  2.33.1-1 
2021/11/09 02:52:01   22   erase git                  2.33.0-1 
2021/11/09 02:52:01   23 install gawk                 5.1.1-1  
2021/11/09 02:52:01   24   erase gawk                 5.1.0-1  
2021/11/09 02:52:01   25 install perl-libwww-perl     6.58-1   
2021/11/09 02:52:01   26   erase perl-libwww-perl     6.57-1   
2021/11/09 02:52:01   27 install libpcre2-posix3      10.39-1  
2021/11/09 02:52:01   28   erase libpcre2-posix3      10.38-1  
2021/11/09 02:52:01   29 install lynx                 2.8.9-13 
2021/11/09 02:52:01   30   erase lynx                 2.8.7-2  
2021/11/09 02:52:01   31 install gitk                 2.33.1-1 
2021/11/09 02:52:01   32   erase gitk                 2.33.0-1 
2021/11/09 02:52:01   33 install git-email            2.33.1-1 
2021/11/09 02:52:01   34   erase git-email            2.33.0-1 
2021/11/09 02:52:01   35 install git-gui              2.33.1-1 
2021/11/09 02:52:01   36   erase git-gui              2.33.0-1 
[snip]
2021/11/09 02:52:19 Uninstalling cygwin
2021/11/09 02:52:19 Uninstalling cygwin-debuginfo
2021/11/09 02:52:20 Uninstalling cygwin-devel
2021/11/09 02:52:20 Uninstalling cygwin-doc
2021/11/09 02:52:26 Uninstalling perl-DateTime-Locale
2021/11/09 02:52:28 Uninstalling perl-URI
2021/11/09 02:52:29 Uninstalling libpcre2_8_0
2021/11/09 02:52:29 Uninstalling libpcre2_16_0
2021/11/09 02:52:29 Uninstalling libopenldap2_5_0
2021/11/09 02:52:29 Uninstalling libarchive13
2021/11/09 02:52:29 Uninstalling git
2021/11/09 02:52:30 Uninstalling gawk
2021/11/09 02:52:30 Uninstalling perl-libwww-perl
2021/11/09 02:52:31 Uninstalling libpcre2-posix3
2021/11/09 02:52:31 Uninstalling lynx
2021/11/09 02:52:31 Uninstalling gitk
2021/11/09 02:52:31 Uninstalling git-email
2021/11/09 02:52:31 Uninstalling git-gui
[snip]
2021/11/09 02:55:50 Changing gid to Administrators
2021/11/09 02:55:54 note: In-use files have been replaced. You need to reboot as soon as possible to activate the new versions. Cygwin may operate incorrectly until you reboot.
2021/11/09 02:55:54 Ending cygwin install

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-11 14:56         ` Ramsay Jones
@ 2022-07-13 14:42           ` Adam Dinwoodie
  2022-07-13 19:16             ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Adam Dinwoodie @ 2022-07-13 14:42 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: Junio C Hamano, GIT Mailing-list, Johannes Schindelin, Jeff King

On Mon, 11 Jul 2022 at 15:56, Ramsay Jones <ramsay@ramsayjones.plus.com> wrote:
> On 11/07/2022 14:39, Adam Dinwoodie wrote:
> [snip]
>
> >
> > Minor progress update: I've now confirmed the failure was introduced by
> > a change in the Cygwin library between the binaries for Cygwin versions
> > 3.2.0-1 and 3.3.1-1. Specifically, the test passes with Cygwin from the
> > 27 October 2021 package archive[0], and fails with Cygwin from the 28
> > October 2021 archive[1], and the only difference between the two that
> > has any chance of being relevant is that bump in the Cygwin release.
>
> Heh, I was just about to email you with similar news! I had a look at
> my setup.log to see what I actually updated (and from what) and the
> only thing that seemed to make sense was an update of the cygwin .dll
> from v3.2.0-1 to v3.3.2-1 (I will add below an extract from my setup.log
> for that day, in case you see anything else of interest).

Having now spent far more time than I'd like wrangling the Cygwin build
infrastructure, I've found the change in the Cygwin code that introduced
the break. But I'm afraid progressing beyond this -- and in particular
making any sort of judgement about appropriate next steps -- is beyond
what I'm going to have time for in the foreseeable future.

(On the plus side, this has given me the kick to actually work out how
to do this sort of investigation, so if and when I get to investigating
the other test failures that also seem to be caused by changes in the
Cygwin environment, I now have a much better idea what I'm doing!)

Relevant commit is below:

https://cygwin.com/git/?p=newlib-cygwin.git;a=commitdiff;h=ef95c03522f65d5956a8dc82d869c6bc378ef3f9

commit ef95c03522f65d5956a8dc82d869c6bc378ef3f9 (HEAD, refs/bisect/bad)
Author: Corinna Vinschen <corinna@vinschen.de>
Date:   Tue Apr 6 21:35:43 2021 +0200

    Cygwin: select: Fix FD_CLOSE handling
    
    An FD_CLOSE event sets a socket descriptor ready for writing.
    This is incorrect if the FD_CLOSE is a result of shutdown(SHUT_RD).
    Only set the socket descriptor ready for writing if the FD_CLOSE
    is indicating an connection abort or reset error condition.
    
    This requires to tweak fhandler_socket_wsock::evaluate_events.
    FD_CLOSE in conjunction with FD_ACCEPT/FD_CONNECT special cases
    a shutdown condition by setting an error code.  This is correct
    for accept/connect, but not for select.  In this case, make sure
    to return with an error code only if FD_CLOSE indicates a
    connection error.
    
    Signed-off-by: Corinna Vinschen <corinna@vinschen.de>

diff --git a/winsup/cygwin/fhandler_socket_inet.cc b/winsup/cygwin/fhandler_socket_inet.cc
index bc08d3cf1..4ecb31a27 100644
--- a/winsup/cygwin/fhandler_socket_inet.cc
+++ b/winsup/cygwin/fhandler_socket_inet.cc
@@ -361,20 +361,30 @@ fhandler_socket_wsock::evaluate_events (const long event_mask, long &events,
 	  wsock_events->events |= FD_WRITE;
 	  wsock_events->connect_errorcode = 0;
 	}
-      /* This test makes accept/connect behave as on Linux when accept/connect
-         is called on a socket for which shutdown has been called.  The second
-	 half of this code is in the shutdown method. */
       if (events & FD_CLOSE)
 	{
-	  if ((event_mask & FD_ACCEPT) && saw_shutdown_read ())
+	  if (evts.iErrorCode[FD_CLOSE_BIT])
 	    {
-	      WSASetLastError (WSAEINVAL);
+	      WSASetLastError (evts.iErrorCode[FD_CLOSE_BIT]);
 	      ret = SOCKET_ERROR;
 	    }
-	  if (event_mask & FD_CONNECT)
+	  /* This test makes accept/connect behave as on Linux when accept/
+	     connect is called on a socket for which shutdown has been called.
+	     The second half of this code is in the shutdown method.  Note that
+	     we only do this when called from accept/connect, not from select.
+	     In this case erase == false, just as with read (MSG_PEEK). */
+	  if (erase)
 	    {
-	      WSASetLastError (WSAECONNRESET);
-	      ret = SOCKET_ERROR;
+	      if ((event_mask & FD_ACCEPT) && saw_shutdown_read ())
+		{
+		  WSASetLastError (WSAEINVAL);
+		  ret = SOCKET_ERROR;
+		}
+	      if (event_mask & FD_CONNECT)
+		{
+		  WSASetLastError (WSAECONNRESET);
+		  ret = SOCKET_ERROR;
+		}
 	    }
 	}
       if (erase)
diff --git a/winsup/cygwin/select.cc b/winsup/cygwin/select.cc
index 956cd9bc1..b493ccc11 100644
--- a/winsup/cygwin/select.cc
+++ b/winsup/cygwin/select.cc
@@ -1709,15 +1709,18 @@ peek_socket (select_record *me, bool)
   fhandler_socket_wsock *fh = (fhandler_socket_wsock *) me->fh;
   long events;
   /* Don't play with the settings again, unless having taken a deep look into
-     Richard W. Stevens Network Programming book.  Thank you. */
+     Richard W. Stevens Network Programming book and how these flags are
+     defined in Winsock.  Thank you. */
   long evt_mask = (me->read_selected ? (FD_READ | FD_ACCEPT | FD_CLOSE) : 0)
 		| (me->write_selected ? (FD_WRITE | FD_CONNECT | FD_CLOSE) : 0)
 		| (me->except_selected ? FD_OOB : 0);
   int ret = fh->evaluate_events (evt_mask, events, false);
   if (me->read_selected)
     me->read_ready |= ret || !!(events & (FD_READ | FD_ACCEPT | FD_CLOSE));
   if (me->write_selected)
-    me->write_ready |= ret || !!(events & (FD_WRITE | FD_CONNECT | FD_CLOSE));
+    /* Don't check for FD_CLOSE here.  Only an error case (ret == -1)
+       will set ready for writing. */
+    me->write_ready |= ret || !!(events & (FD_WRITE | FD_CONNECT));
   if (me->except_selected)
     me->except_ready |= !!(events & FD_OOB);
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-13 14:42           ` Adam Dinwoodie
@ 2022-07-13 19:16             ` Jeff King
  2022-07-13 20:35               ` Ramsay Jones
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2022-07-13 19:16 UTC (permalink / raw)
  To: Adam Dinwoodie
  Cc: Ramsay Jones, Junio C Hamano, GIT Mailing-list, Johannes Schindelin

On Wed, Jul 13, 2022 at 03:42:38PM +0100, Adam Dinwoodie wrote:

> Having now spent far more time than I'd like wrangling the Cygwin build
> infrastructure, I've found the change in the Cygwin code that introduced
> the break. But I'm afraid progressing beyond this -- and in particular
> making any sort of judgement about appropriate next steps -- is beyond
> what I'm going to have time for in the foreseeable future.

Hmm. So I had assumed that the problem was unlink()ing the socket path
while the client was still trying to read(). If that's the case, then I
_think_ the minimal reproduction below should also trigger the problem.
That might give you something more useful to show to Cygwin folks.

But...

> commit ef95c03522f65d5956a8dc82d869c6bc378ef3f9 (HEAD, refs/bisect/bad)
> Author: Corinna Vinschen <corinna@vinschen.de>
> Date:   Tue Apr 6 21:35:43 2021 +0200
> 
>     Cygwin: select: Fix FD_CLOSE handling
>     
>     An FD_CLOSE event sets a socket descriptor ready for writing.
>     This is incorrect if the FD_CLOSE is a result of shutdown(SHUT_RD).
>     Only set the socket descriptor ready for writing if the FD_CLOSE
>     is indicating an connection abort or reset error condition.
>     
>     This requires to tweak fhandler_socket_wsock::evaluate_events.
>     FD_CLOSE in conjunction with FD_ACCEPT/FD_CONNECT special cases
>     a shutdown condition by setting an error code.  This is correct
>     for accept/connect, but not for select.  In this case, make sure
>     to return with an error code only if FD_CLOSE indicates a
>     connection error.

...this is about select outcomes. Which makes me think two things
(though these are really pretty blind guesses, as I'm not at all
familiar with Cygwin's socket code):

  - it may be that this situation was always ECONNABORTED (it's not on
    Linux, but perhaps due to details of the underlying socket code on
    Windows, it's different there). And this patch is just surfacing
    that state to the caller better.

  - it could be unrelated to the unlink() entirely. The socket code in
    Git dups the client descriptor and opens two FILE* handles, one for
    reading and one for writing. Could it be that it's important to
    fclose() one before the other (and the implicit closing done by
    exit() does the wrong order)? It seems like a stretch, but this
    commit message is talking about a shutdown(SHUT_RD), which is kind
    of what you get by fclose-ing one side (though again, not on Linux,
    because the "FILE*" are aware that they are read/write, but the
    underlying descriptors aren't).

Or maybe those are just totally off track. I know you said you don't
have time to dig further, and that's fine. But if you or anybody has a
chance to try the program below, it might be interesting to see the
result (it would confirm whether it's the unlink() that's the problem).

-- >8 --
#include <stdio.h>
#include <stdlib.h>

#include <unistd.h>
#include <sys/socket.h>
#include <sys/un.h>

#define SOCKET_PATH "mysocket"

static void diesys(const char *syscall)
{
	perror(syscall);
	exit(1);
}

static int create_server(void)
{
	int fd;
	struct sockaddr_un sa = {
		.sun_family = AF_UNIX,
		.sun_path = SOCKET_PATH,
	};

	fd = socket(AF_UNIX, SOCK_STREAM, 0);
	if (fd < 0)
		diesys("socket");
	if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0)
		diesys("bind");
	if (listen(fd, 5) < 0)
		diesys("listen");

	return fd;
}

static void do_server(int listen_fd)
{
	int client_fd = accept(listen_fd, NULL, NULL);
	if (client_fd < 0)
		diesys("accept");
	if (unlink(SOCKET_PATH) < 0)
		diesys("unlink");
	if (close(client_fd) < 0)
		diesys("close(client)");
	close(listen_fd);
}

static void do_client(void)
{
	int fd;
	char buf[64];
	struct sockaddr_un sa = {
		.sun_family = AF_UNIX,
		.sun_path = SOCKET_PATH,
	};

	fd = socket(AF_UNIX, SOCK_STREAM, 0);
	if (fd < 0)
		diesys("socket");
	if (connect(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0)
		diesys("connect");
	if (read(fd, buf, sizeof(buf)) < 0)
		diesys("read");
	close(fd);
}

int main(void)
{
	int fd = create_server();

	if (fork()) {
		do_server(fd);
	} else {
		close(fd);
		do_client();
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: t0301-credential-cache test failure on cygwin
  2022-07-13 19:16             ` Jeff King
@ 2022-07-13 20:35               ` Ramsay Jones
  0 siblings, 0 replies; 15+ messages in thread
From: Ramsay Jones @ 2022-07-13 20:35 UTC (permalink / raw)
  To: Jeff King, Adam Dinwoodie
  Cc: Junio C Hamano, GIT Mailing-list, Johannes Schindelin



On 13/07/2022 20:16, Jeff King wrote:
> On Wed, Jul 13, 2022 at 03:42:38PM +0100, Adam Dinwoodie wrote:
> 
> Hmm. So I had assumed that the problem was unlink()ing the socket path
> while the client was still trying to read(). If that's the case, then I
> _think_ the minimal reproduction below should also trigger the problem.
> That might give you something more useful to show to Cygwin folks.

Hmm, good find - I looked at this commit while searching the gitweb pages the
other night, but didn't think it was relevant! :( Oh, well ...

I don't have too much time tonight, but I gave your program a quick try:

  $ gcc -o test-unix-sock test-unix-sock.c
  $ ./test-unix-sock
  $ ls -l my*
  ls: cannot access 'my*': No such file or directory
  $ ./test-unix-sock
  $ echo $?
  0
  $ 

HTH.

ATB,
Ramsay Jones



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-07-13 20:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-07  1:50 t0301-credential-cache test failure on cygwin Ramsay Jones
2022-07-07  6:15 ` Junio C Hamano
2022-07-07 15:17   ` Ramsay Jones
2022-07-07 18:19     ` Jeff King
2022-07-07 18:29     ` Jeff King
2022-07-07 19:14       ` Ramsay Jones
2022-07-11  7:49     ` Adam Dinwoodie
2022-07-11 13:39       ` Adam Dinwoodie
2022-07-11 14:56         ` Ramsay Jones
2022-07-13 14:42           ` Adam Dinwoodie
2022-07-13 19:16             ` Jeff King
2022-07-13 20:35               ` Ramsay Jones
2022-07-07 18:12 ` Jeff King
2022-07-07 18:26   ` Junio C Hamano
2022-07-07 19:59   ` Ramsay Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.