linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rsync over ssh on 2.4.2 to 2.2.18
@ 2001-02-27 10:02 Russell King
  2001-02-27 10:18 ` Rasmus Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Russell King @ 2001-02-27 10:02 UTC (permalink / raw)
  To: linux-kernel

Hi,

I'm seeing odd behaviour with rsync over ssh between two x86 machines -
one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
serial driver hacks (for running Amplicon cards with their ISA interrupt-
sharing scheme).

Please note: although I am using 2.2.15pre13, it is _not_ the cause of
this problem, and I don't particularly want to touch the 2.2.15pre13
machine at the present time, which has been happy for the past year or
so.

netstat on isdn-gw shows the following:

	Proto Recv-Q Send-Q Local Address           Foreign Address         State
	tcp    72868      0 isdn-gw.piltdown.a:1023 pilt-gw.piltdown.at:ssh ESTABLISHED

netstat on pilt-gw shows the following:

	Proto Recv-Q Send-Q Local Address           Foreign Address         State
	tcp        0  14372 pilt-gw.piltdown.at:ssh isdn-gw.piltdown.a:1023 ESTABLISHED

Looking on the network reveals:

	09:49:23.215852 P pilt-gw.ssh > isdn-gw.1023: . 1119640864:1119640864(0) ack 1175090336 win 31856 <nop,nop,timestamp 114834687 8389267> (DF) [tos 0x8]
	09:49:23.216095 P isdn-gw.1023 > pilt-gw.ssh: . 1:1(0) ack 1 win 0 <nop,nop,timestamp 8401267 114186976> (DF) [tos 0x8]

So isdn-gw is saying it can't accept more data, which ties up with what
netstat is indicating.  Note that this situation has persisted for around
12 hours with neither end moving forward.

When I straced the ssh process on isdn-gw this morning, things start moving,
and in this instance everything completed normally, even after I stopped
stracing ssh.  However, there are times when I have to leave strace running
to get the rsync to complete.

select(4, [3], [3], NULL, NULL)         = 2 (in [3], out [3])
read(3, "\0\0\20\t\215R\241\355\201_o\227\4\371\340+\333_\333\34"..., 8192) = 8192
write(3, "\370\325|\355\232\256\247Kf\202\v\272\334@\231\317]\322"..., 135940) = 24616

On this instance, it appears that sshd has gotten stuck in select().
Maybe there is a race condition or missing wakeup in the TCP code?

Note that this problem also exists on 2.4.0 (I upgraded from 2.4.0 to
2.4.2 yesterday to see if the behaviour has changed).

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 10:02 rsync over ssh on 2.4.2 to 2.2.18 Russell King
@ 2001-02-27 10:18 ` Rasmus Andersen
  2001-02-27 10:58   ` bert hubert
  2001-02-27 10:49 ` David S. Miller
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Rasmus Andersen @ 2001-02-27 10:18 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel

On Tue, Feb 27, 2001 at 10:02:11AM +0000, Russell King wrote:
> Hi,
> 
> I'm seeing odd behaviour with rsync over ssh between two x86 machines -
> one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
> is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
> serial driver hacks (for running Amplicon cards with their ISA interrupt-
> sharing scheme).

[...]

I have seen behaviour similar to this (descibed in 
http://marc.theaimsgroup.com/?l=linux-kernel&m=98262067309185&w=2)
locally on a 2.4.X machine where the rsync completes but fails to
terminate, apparently because the child does not receive the KILLUSR1
(wild speculation)? Anyways, the parent process waits in wait4 and
the child loops, waiting for the signal. This is not reproducable
in 2.2.X (for me).

I have found that in order to reproduce this problem I need to
make rsync do a certain amount of work. Ie., rsync linux/drivers
new-dir will make rsync hang, but rsync linux/drivers/[a-p]* does
not. Does the same thing apply for you? stracing the rsync does
make it all work if I specify that rsync should trace the child,
too.


Regards,
  Rasmus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 10:02 rsync over ssh on 2.4.2 to 2.2.18 Russell King
  2001-02-27 10:18 ` Rasmus Andersen
@ 2001-02-27 10:49 ` David S. Miller
  2001-02-27 19:46   ` Russell King
  2001-02-27 13:18 ` Anton Blanchard
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: David S. Miller @ 2001-02-27 10:49 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel


Russell King writes:
 > Please note: although I am using 2.2.15pre13, it is _not_ the cause of
 > this problem

How do you know this?  There are so many deadly TCP bugs fixed
since 2.2.15pre13 I don't know how you can assert this.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 10:18 ` Rasmus Andersen
@ 2001-02-27 10:58   ` bert hubert
  0 siblings, 0 replies; 15+ messages in thread
From: bert hubert @ 2001-02-27 10:58 UTC (permalink / raw)
  To: linux-kernel

On Tue, Feb 27, 2001 at 11:18:39AM +0100, Rasmus Andersen wrote:

> terminate, apparently because the child does not receive the KILLUSR1
> (wild speculation)? Anyways, the parent process waits in wait4 and
> the child loops, waiting for the signal. This is not reproducable
> in 2.2.X (for me).

rsync problems of this type abound across several UNIX platforms - I had
problems like this on FreeBSD, but never under Linux. I suspect that rsync
bends or at least stretches certain limits.

Regards,

bert

-- 
http://www.PowerDNS.com      Versatile DNS Services  
Trilab                       The Technology People   
'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 10:02 rsync over ssh on 2.4.2 to 2.2.18 Russell King
  2001-02-27 10:18 ` Rasmus Andersen
  2001-02-27 10:49 ` David S. Miller
@ 2001-02-27 13:18 ` Anton Blanchard
  2001-02-27 16:48   ` Russell King
  2001-02-27 20:35 ` kuznet
  2001-02-27 23:01 ` Eugene Crosser
  4 siblings, 1 reply; 15+ messages in thread
From: Anton Blanchard @ 2001-02-27 13:18 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel


Hi,

> I'm seeing odd behaviour with rsync over ssh between two x86 machines -
> one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
> is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
> serial driver hacks (for running Amplicon cards with their ISA interrupt-
> sharing scheme).

What version of ssh are you using? Older versions would use blocking IO
which would result in deadlocks (and angry emails wrongly blaming rsync :)

Anton

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 13:18 ` Anton Blanchard
@ 2001-02-27 16:48   ` Russell King
  0 siblings, 0 replies; 15+ messages in thread
From: Russell King @ 2001-02-27 16:48 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-kernel

Anton Blanchard writes:
> > I'm seeing odd behaviour with rsync over ssh between two x86 machines -
> > one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
> > is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
> > serial driver hacks (for running Amplicon cards with their ISA interrupt-
> > sharing scheme).
> 
> What version of ssh are you using? Older versions would use blocking IO
> which would result in deadlocks (and angry emails wrongly blaming rsync :)

Note that I proved that it was sitting in select(), and therefore can't be
blocking.

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 10:49 ` David S. Miller
@ 2001-02-27 19:46   ` Russell King
  0 siblings, 0 replies; 15+ messages in thread
From: Russell King @ 2001-02-27 19:46 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel

David S. Miller writes:
> How do you know this?  There are so many deadly TCP bugs fixed
> since 2.2.15pre13 I don't know how you can assert this.

Through the evidence I gave.  

Firstly, if the receiving side is saying that it has a window of 0, then
it is not going to accept any more data.

Secondly, the receiving side has data waiting in the receive queue.

Thirdly, the receiving process is selecting on the socket, and dispite
there being data waiting, select is saying that there is no data waiting.

All the data is pointing at the 2.4 kernel as being the culprit.

I'm surprised at your response given the amount of hard evidence I gave that
supports my assertion.

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 10:02 rsync over ssh on 2.4.2 to 2.2.18 Russell King
                   ` (2 preceding siblings ...)
  2001-02-27 13:18 ` Anton Blanchard
@ 2001-02-27 20:35 ` kuznet
  2001-02-27 20:43   ` Russell King
  2001-02-27 23:01 ` Eugene Crosser
  4 siblings, 1 reply; 15+ messages in thread
From: kuznet @ 2001-02-27 20:35 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel

Hello!

> netstat on isdn-gw shows the following:
> 
> 	Proto Recv-Q Send-Q Local Address           Foreign Address         State
> 	tcp    72868      0 isdn-gw.piltdown.a:1023 pilt-gw.piltdown.at:ssh ESTABLISHED
plus
> select(4, [3], [3], NULL, NULL)         = 2 (in [3], out [3])


> Maybe there is a race condition or missing wakeup in the TCP code?

Moreover, even not _one_ wakeup is missing. At least two, because
wakeups in read and write are separate and you have stuck in both directions.
8)8)

Well, if it was one I would start to dig ground inside tcp instantly.
But as soon as two of them are missing, I have to suspect wake_up itself.
At least, we had such bugs there until 2.4.0.

Alexey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 20:35 ` kuznet
@ 2001-02-27 20:43   ` Russell King
  2001-02-28 20:27     ` kuznet
  0 siblings, 1 reply; 15+ messages in thread
From: Russell King @ 2001-02-27 20:43 UTC (permalink / raw)
  To: kuznet; +Cc: linux-kernel

kuznet@ms2.inr.ac.ru writes:
> Moreover, even not _one_ wakeup is missing. At least two, because
> wakeups in read and write are separate and you have stuck in both directions.
> 8)8)

I'll see if I can strace it from the start until it hangs tomorrow.

> Well, if it was one I would start to dig ground inside tcp instantly.
> But as soon as two of them are missing, I have to suspect wake_up itself.
> At least, we had such bugs there until 2.4.0.

I was running at one point a 2.4.0-test kernel, but I didn't see these
effects back then (but then I wasn't monitoring the system as closely as
I am at the moment).

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 10:02 rsync over ssh on 2.4.2 to 2.2.18 Russell King
                   ` (3 preceding siblings ...)
  2001-02-27 20:35 ` kuznet
@ 2001-02-27 23:01 ` Eugene Crosser
  2001-02-28 20:20   ` kuznet
  4 siblings, 1 reply; 15+ messages in thread
From: Eugene Crosser @ 2001-02-27 23:01 UTC (permalink / raw)
  To: linux-kernel

In article <200102271002.f1RA2B408058@brick.arm.linux.org.uk>,
        Russell King <rmk@arm.linux.org.uk> writes:

> I'm seeing odd behaviour with rsync over ssh between two x86 machines -

I've seen hanging rsync over ssh more than once, while sending much data
from an x86 running Linux (late 2.3.x) to Sparc/Solaris2.5.1 over ssh
1.2.26.  I had a feeling that it was triggered by certain data patterns
because it often stopped at the same spot after restart (and therefore
I attributed it to a bug in rsync).  I did not investigate further.

Eugene

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 23:01 ` Eugene Crosser
@ 2001-02-28 20:20   ` kuznet
  2001-03-19  7:37     ` Russell King
  0 siblings, 1 reply; 15+ messages in thread
From: kuznet @ 2001-02-28 20:20 UTC (permalink / raw)
  To: Eugene Crosser; +Cc: linux-kernel

Hello!

> I've seen hanging rsync over ssh more than once, while sending much data
> from an x86 running Linux (late 2.3.x) to Sparc/Solaris2.5.1

I remember this your report. However, recent news force to suspect
that the reason was in Solaris yet. Actually, if you send tcpdump of
failed session, this question can be answered.


Also REMEMBER!

rsync has __no__ chances to work, if one of sides is Linux-2.2 before 2.2.17
or Solaris of today (sunos-2.5.1 is enough old, probably, it still was right
unlike subsequent kernels). These stacks have conciding set of fatal bugs,
which prevent transmits with closed window.

To use rsync it is necessary to upgrade to >=2.2.17.

Alexey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-27 20:43   ` Russell King
@ 2001-02-28 20:27     ` kuznet
  0 siblings, 0 replies; 15+ messages in thread
From: kuznet @ 2001-02-28 20:27 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel

Hello!

> I'll see if I can strace it from the start until it hangs tomorrow.

Please...

Also, try to make binary tcpdump.


> I was running at one point a 2.4.0-test kernel, but I didn't see these

Yes, it did not result in full stall. Lost wakeups were recovered
f.e. by any keyboard activity. 8)

Wow! 2.2.15? rsync surely does work with 2.2.15. The first 2.2 fixing
the same bugs, which solaris has now, was 2.2.17.

Alexey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-02-28 20:20   ` kuznet
@ 2001-03-19  7:37     ` Russell King
  2001-03-19 18:56       ` kuznet
  0 siblings, 1 reply; 15+ messages in thread
From: Russell King @ 2001-03-19  7:37 UTC (permalink / raw)
  To: kuznet; +Cc: Eugene Crosser, linux-kernel

On Wed, Feb 28, 2001 at 11:20:22PM +0300, kuznet@ms2.inr.ac.ru wrote:
> I remember this your report. However, recent news force to suspect
> that the reason was in Solaris yet. Actually, if you send tcpdump of
> failed session, this question can be answered.

Well, since I moved the rsync to 5pm, and then back to 9pm, I haven't
seen this problem - everything is again working as expected (touch wood)
with 2.2.15pre13 and 2.4.0.

This is odd, since it wasn't a one-off problem, but something that happened
each and every day of a particular week.  Anyway, if it starts happening
again, I'll get a tcpdump of the session.

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-03-19  7:37     ` Russell King
@ 2001-03-19 18:56       ` kuznet
  2001-03-21 21:54         ` Rusty Russell
  0 siblings, 1 reply; 15+ messages in thread
From: kuznet @ 2001-03-19 18:56 UTC (permalink / raw)
  To: Russell King; +Cc: crosser, linux-kernel

Hello!

> Well, since I moved the rsync to 5pm, and then back to 9pm, I haven't
> seen this problem - everything is again working as expected (touch wood)
> with 2.2.15pre13 and 2.4.0.
> 
> This is odd, since it wasn't a one-off problem, but something that happened
> each and every day of a particular week.  Anyway, if it starts happening
> again, I'll get a tcpdump of the session.

Well... I can reproduce this not depending of day of week now. :-(

If I understood Andrew's mail correctly, rsync freezes when 
large amount of errors happen. Particularly, here ssh always freezes
trying to write to stdout (pipe linked to rsync) some chunk of data (>64K), 
consisting of reports sort of "stat(usr/X11R6/bin/xmbind) : No such file"
(my /usr has huge amount of stray symlinks...). rsync does not read
this pipe, trying to write some more data to pipe, which is
pipe linked to rsync stdin. Dead lock.

Note, that doing strace you do not see this write()! write() is interrupted
by strace and you see succesful select(). You can see real place of lockup
via ps axl or starting strace before lockup happened.

Andrew said about one more common case, when large amount of errors happens:
wrong permission on some target directory.

I have impression that long ago I observed the same affect,
when disk space at target exhausted.

Why do I tall this? Well, probably, something changed at fs,
wich you rsync and rsync generates less errors.

Alexey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rsync over ssh on 2.4.2 to 2.2.18
  2001-03-19 18:56       ` kuznet
@ 2001-03-21 21:54         ` Rusty Russell
  0 siblings, 0 replies; 15+ messages in thread
From: Rusty Russell @ 2001-03-21 21:54 UTC (permalink / raw)
  To: kuznet; +Cc: Russell King, crosser, linux-kernel

In message <200103191856.VAA01054@ms2.inr.ac.ru> you write:
> If I understood Andrew's mail correctly, rsync freezes when 
> large amount of errors happen. Particularly, here ssh always freezes

Known hard-to-fix bug in rsync; too many errors in the pipe, and it
locks up.

Rusty.
--
Premature optmztion is rt of all evl. --DK

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2001-03-23 12:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-27 10:02 rsync over ssh on 2.4.2 to 2.2.18 Russell King
2001-02-27 10:18 ` Rasmus Andersen
2001-02-27 10:58   ` bert hubert
2001-02-27 10:49 ` David S. Miller
2001-02-27 19:46   ` Russell King
2001-02-27 13:18 ` Anton Blanchard
2001-02-27 16:48   ` Russell King
2001-02-27 20:35 ` kuznet
2001-02-27 20:43   ` Russell King
2001-02-28 20:27     ` kuznet
2001-02-27 23:01 ` Eugene Crosser
2001-02-28 20:20   ` kuznet
2001-03-19  7:37     ` Russell King
2001-03-19 18:56       ` kuznet
2001-03-21 21:54         ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).