linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Hour long timeout to ssh/telnet/ftp to down host?
@ 2001-06-12 21:02 Rob Landley
  2001-06-13  2:18 ` Ben Greear
  0 siblings, 1 reply; 5+ messages in thread
From: Rob Landley @ 2001-06-12 21:02 UTC (permalink / raw)
  To: linux-kernel

I have scripts that ssh into large numbers of boxes, which are sometimes 
down.  The timeout for figuring out the box is down is over an hour.  This is 
just insane.

Telnet and ftp behave similarly, or at least tthey lasted the 5 minutes I was 
willing to wait, anyway.  Basically anything that calls connect().  If the 
box doesn't respond in 15 seconds, I want to give up.

Is this a problem with the kernel or with glibc?  If it's the kernel, I'd 
expect a /proc entry where I can set this, but I can't seem to find one.  Is 
there one?  What would be involved in writing one?

If it's glibc I'm probably better off writing a wrapper to ping the 
destination before trying to connect, or killing the connection after a 
timeout with no traffic.  But both of those are really ugly solutions.

Anybody have any light to shed on the situation?

Rob

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hour long timeout to ssh/telnet/ftp to down host?
  2001-06-13  2:18 ` Ben Greear
@ 2001-06-12 21:47   ` Rob Landley
  2001-06-13  9:40   ` Luigi Genoni
  1 sibling, 0 replies; 5+ messages in thread
From: Rob Landley @ 2001-06-12 21:47 UTC (permalink / raw)
  To: Ben Greear, landley; +Cc: linux-kernel

>You can tune things by setting the tcp-timeout probably..I don't
>know exactly where to set this..

Aha, found it.  /proc/sys/net/ipv4/tcp_syn_retries

I am a victim of the exponential retry falloff, it would seem.  syn_retries 
of 1 takes a few seconds, 3 takes less than half a minute, and 5 takes 
several minutes.  The default value of 10 is what's giving me the problem 
(something like 20 minutes to time out, according to my earlier tests.)

Then the fact that ssh then re-attempts the connection four times before 
actually failing is where I got my hour and change timeout.  ("ssh -v -v -v" 
comes in handy...)

Fun.

Can we change the default value for this to something more sane, like 5?  
Exponential falloff is not good when your order of magnitude hits double 
digits.

> You probably don't want all tcp to time out at 15 seconds anyway, so

Just connection initiation.  (If their ip stack hasn't replied to me by then, 
I doubt it's going to.)

> I'd suggest either using non-blocking connect (if you have the code
> that does the connect), or just set a timer (or use sigalarm) when you
> start the attempt, and fail the attempt if the timer or alarm signal
> goes off.

Except I'm using off-the-shelf ssh.  (I asked them about this problem a month 
ago, and there was some discussion of a workaround on their mailing list, but 
2.9 came out and still had the same behavior.  Apparently, if it's not a 
problem in OpenBSD, it's not a problem in OpenSSH...)

> > If it's glibc I'm probably better off writing a wrapper to ping the
> > destination before trying to connect, or killing the connection after a
> > timeout with no traffic.  But both of those are really ugly solutions.
>
> Ugly is relative, and don't use ping because there is still a race
> condition (ping worked, but by the time you try tcp, the box is down.)

Yeah, but it would eventually time out and recover, I've got ten threads out 
querying boxes, that's a really rare race condition.  And I already 
acknowledged it was ugly. :)

So the problem is just that tcp_syn_retries' default value of 10 is way too 
high due to the exponentially increasing gap between each retry.

Rob

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hour long timeout to ssh/telnet/ftp to down host?
  2001-06-12 21:02 Hour long timeout to ssh/telnet/ftp to down host? Rob Landley
@ 2001-06-13  2:18 ` Ben Greear
  2001-06-12 21:47   ` Rob Landley
  2001-06-13  9:40   ` Luigi Genoni
  0 siblings, 2 replies; 5+ messages in thread
From: Ben Greear @ 2001-06-13  2:18 UTC (permalink / raw)
  To: landley; +Cc: linux-kernel

Rob Landley wrote:
> 
> I have scripts that ssh into large numbers of boxes, which are sometimes
> down.  The timeout for figuring out the box is down is over an hour.  This is
> just insane.
> 
> Telnet and ftp behave similarly, or at least tthey lasted the 5 minutes I was
> willing to wait, anyway.  Basically anything that calls connect().  If the
> box doesn't respond in 15 seconds, I want to give up.
> 
> Is this a problem with the kernel or with glibc?  If it's the kernel, I'd
> expect a /proc entry where I can set this, but I can't seem to find one.  Is
> there one?  What would be involved in writing one?
> 

You can tune things by setting the tcp-timeout probably..I don't
know exactly where to set this..

You probably don't want all tcp to time out at 15 seconds anyway, so
I'd suggest either using non-blocking connect (if you have the code
that does the connect), or just set a timer (or use sigalarm) when you
start the attempt, and fail the attempt if the timer or alarm signal
goes off.

> If it's glibc I'm probably better off writing a wrapper to ping the
> destination before trying to connect, or killing the connection after a
> timeout with no traffic.  But both of those are really ugly solutions.

Ugly is relative, and don't use ping because there is still a race condition
(ping worked, but by the time you try tcp, the box is down.)

> 
> Anybody have any light to shed on the situation?
> 
> Rob
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Ben Greear <greearb@candelatech.com>          <Ben_Greear@excite.com>
President of Candela Technologies Inc      http://www.candelatech.com
ScryMUD:  http://scry.wanfear.com     http://scry.wanfear.com/~greear

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hour long timeout to ssh/telnet/ftp to down host?
  2001-06-13  2:18 ` Ben Greear
  2001-06-12 21:47   ` Rob Landley
@ 2001-06-13  9:40   ` Luigi Genoni
  2001-06-13 10:07     ` Rob Landley
  1 sibling, 1 reply; 5+ messages in thread
From: Luigi Genoni @ 2001-06-13  9:40 UTC (permalink / raw)
  To: Ben Greear; +Cc: landley, linux-kernel



On Tue, 12 Jun 2001, Ben Greear wrote:

> Rob Landley wrote:
> >
> > I have scripts that ssh into large numbers of boxes, which are sometimes
> > down.  The timeout for figuring out the box is down is over an hour.  This is
> > just insane.
> >
> > Telnet and ftp behave similarly, or at least tthey lasted the 5 minutes I was
> > willing to wait, anyway.  Basically anything that calls connect().  If the
> > box doesn't respond in 15 seconds, I want to give up.
> >
> > Is this a problem with the kernel or with glibc?  If it's the kernel, I'd
> > expect a /proc entry where I can set this, but I can't seem to find one.  Is
> > there one?  What would be involved in writing one?
> >
>
> You can tune things by setting the tcp-timeout probably..I don't
> know exactly where to set this..

/proc/sys/net/ipv4/tcp_fin_timeout

default is 60.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hour long timeout to ssh/telnet/ftp to down host?
  2001-06-13  9:40   ` Luigi Genoni
@ 2001-06-13 10:07     ` Rob Landley
  0 siblings, 0 replies; 5+ messages in thread
From: Rob Landley @ 2001-06-13 10:07 UTC (permalink / raw)
  To: Luigi Genoni, Ben Greear; +Cc: landley, linux-kernel

On Wednesday 13 June 2001 05:40, Luigi Genoni wrote:
> On Tue, 12 Jun 2001, Ben Greear wrote:

> > You can tune things by setting the tcp-timeout probably..I don't
> > know exactly where to set this..
>
> /proc/sys/net/ipv4/tcp_fin_timeout
>
> default is 60.

Never got that far.  My problem was actually tcp_syn_retries. Remember, I was 
talking to a host that was unplugged.  (I wasn't even getting "host 
unreachable" messages, the packets were just disappearing.)  The default 
timeout in that case is rediculous do to the exponentially increasing delays 
between retries.  10 retries wound up being something like 20 minutes.

I set it to 5 and everything works beautifully now.  ssh (which retries the 
connection 4 times, and used to take over an hour to time out) now takes just 
over 3 minutes, which I can live with.

Rob

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2001-06-13 15:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-06-12 21:02 Hour long timeout to ssh/telnet/ftp to down host? Rob Landley
2001-06-13  2:18 ` Ben Greear
2001-06-12 21:47   ` Rob Landley
2001-06-13  9:40   ` Luigi Genoni
2001-06-13 10:07     ` Rob Landley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).