linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Rsync SSH session hang, AGAIN  -  Help!  Deadlock debugging needed.
@ 2002-10-01 14:49 Stephen D. Williams
  2002-10-04  5:12 ` Rsync SSH session hang, AGAIN - cancel Stephen D. Williams
  0 siblings, 1 reply; 2+ messages in thread
From: Stephen D. Williams @ 2002-10-01 14:49 UTC (permalink / raw)
  To: linux-kernel

This has been a recurring problem for a couple years which I and others 
have experienced.  I was free from it for a while, but after upgrading 
OpenSSL/OpenSSH to avoid the recent exploit it is back and highly 
repeatable.  This has been persistant enough that I am going to start 
with the assumption that it may be a kernel bug, or at least probably 
debuggable definitively only by a proficient kernel developer.  We have 
got to squash this once and for all; SSH is used everywhere and it needs 
to be reliable.  Probably there is a race condition in ssh, as mentioned 
below, but it must be subtle.

rsync/ssh transfers from local system to local system work perfectly. 
 Between the systems, there is nearly always large delays at certain 
times and usually a complete hang.  After a long period, this often 
produces a timeout.  These sytems are on 100baseT on the same switch. 
 One system appears to be having mild packet loss (400 out of 400,000 on 
both send and receive as frame/carrier erros).  BTW, running a cpio 
through the SSH connections causes a nearly immediate hang, so it is 
unlikely to be a problem with rsync.

Both systems work find receiving rsync/ssh from my laptop over a 400Kb 
DSL connection with:
OpenSSH 3.1p1
openssl 0.9.6c
rsync 2.5.4
gcc 2.96
kernel 2.4.19

(systems are a combination of Suse and Redhat 7.3, upgraded variously by 
hand)

My standard rsync/ssh script looks like:

brsyncndz (backup rsync no delete or compression):
#!/bin/sh
if [ "$PORT" = "" ]; then PORT=22; fi
rsync -vv -HpogDtSxlra --partial --progress --stats -e "ssh -p $PORT" $*

On both sides:
OpenSSL-0.9.6g
Openssh-3.4p1
rsync-2.5.5

On 'old' system:
gcc 2.95.2
kernel 2.4.3

On 'new' system:
gcc 2.96
kernel 2.4.20-pre8


References to past discussions:  (Tried the TCP buffers tuning.)
http://lists.insecure.org/linux-kernel/2001/Mar/0374.html
http://lists.insecure.org/linux-kernel/2001/Mar/0380.html
http://lists.insecure.org/linux-kernel/2001/Mar/0400.html
Haven't tried this code yet:
http://lists.insecure.org/linux-kernel/2001/Mar/0652.html

Thanks!
sdw
-- 
sdw@lig.net http://sdw.st
Stephen D. Williams 43392 Wayside Cir,Ashburn,VA 20147-4622
703-724-0118W 703-995-0407Fax Dec2001




^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Rsync SSH session hang, AGAIN  -  cancel
  2002-10-01 14:49 Rsync SSH session hang, AGAIN - Help! Deadlock debugging needed Stephen D. Williams
@ 2002-10-04  5:12 ` Stephen D. Williams
  0 siblings, 0 replies; 2+ messages in thread
From: Stephen D. Williams @ 2002-10-04  5:12 UTC (permalink / raw)
  To: Stephen D. Williams; +Cc: linux-kernel

I didn't think any of the tools that I know, trust, and have used for 
years could be failing me, but I hadn't seen quite this failure mode before.

Even though the 'old' system below was reachable via web, ssh command 
line sessions, etc., any connection between it and another box on the 
same network at the ISP failed after 100-200KB.  While there were a few 
ethernet errors, nothing seemed to increase much or at all with each ssh 
hang.  Nevertheless, it was either a cable or hub/switch at the ISP that 
was bad.  I now theorize that, having a noisy port, the hub/switch 
involved would sense an unacceptably bad port and shut it down, but only 
when burst transfers were between local boxes.  Everything remote was 
spaced enough to be below that threshold.  After a few seconds of no 
activity, the port was apparently unlocked.  I knew that switches had 
bad port sensing, but had never seen a system on the fence like this. 
 It didn't help that these machines were 500 miles away in a colo with 
only daytime staff.

Apologies for the premature post.
sdw

Stephen D. Williams wrote:

> This has been a recurring problem for a couple years which I and 
> others have experienced.  I was free from it for a while, but after 
> upgrading OpenSSL/OpenSSH to avoid the recent exploit it is back and 
> highly repeatable.  This has been persistant enough that I am going to 
> start with the assumption that it may be a kernel bug, or at least 
> probably debuggable definitively only by a proficient kernel 
> developer.  We have got to squash this once and for all; SSH is used 
> everywhere and it needs to be reliable.  Probably there is a race 
> condition in ssh, as mentioned below, but it must be subtle.
>
> rsync/ssh transfers from local system to local system work perfectly. 
> Between the systems, there is nearly always large delays at certain 
> times and usually a complete hang.  After a long period, this often 
> produces a timeout.  These sytems are on 100baseT on the same switch. 
> One system appears to be having mild packet loss (400 out of 400,000 
> on both send and receive as frame/carrier erros).  BTW, running a cpio 
> through the SSH connections causes a nearly immediate hang, so it is 
> unlikely to be a problem with rsync.
>
> Both systems work find receiving rsync/ssh from my laptop over a 400Kb 
> DSL connection with:
> OpenSSH 3.1p1
> openssl 0.9.6c
> rsync 2.5.4
> gcc 2.96
> kernel 2.4.19
>
> (systems are a combination of Suse and Redhat 7.3, upgraded variously 
> by hand)
>
> My standard rsync/ssh script looks like:
>
> brsyncndz (backup rsync no delete or compression):
> #!/bin/sh
> if [ "$PORT" = "" ]; then PORT=22; fi
> rsync -vv -HpogDtSxlra --partial --progress --stats -e "ssh -p $PORT" $*
>
> On both sides:
> OpenSSL-0.9.6g
> Openssh-3.4p1
> rsync-2.5.5
>
> On 'old' system:
> gcc 2.95.2
> kernel 2.4.3
>
> On 'new' system:
> gcc 2.96
> kernel 2.4.20-pre8
>
>
> References to past discussions:  (Tried the TCP buffers tuning.)
> http://lists.insecure.org/linux-kernel/2001/Mar/0374.html
> http://lists.insecure.org/linux-kernel/2001/Mar/0380.html
> http://lists.insecure.org/linux-kernel/2001/Mar/0400.html
> Haven't tried this code yet:
> http://lists.insecure.org/linux-kernel/2001/Mar/0652.html
>
> Thanks!
> sdw





^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2002-10-04  5:07 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-01 14:49 Rsync SSH session hang, AGAIN - Help! Deadlock debugging needed Stephen D. Williams
2002-10-04  5:12 ` Rsync SSH session hang, AGAIN - cancel Stephen D. Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).