All of lore.kernel.org
 help / color / mirror / Atom feed
* NFS auto-reconnect tuning.
@ 2014-09-24 15:39 Benjamin ESTRABAUD
  2014-09-25  1:44 ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin ESTRABAUD @ 2014-09-24 15:39 UTC (permalink / raw)
  To: linux-nfs

Hi!

I've got a scenario where I'm connected to a NFS share on a client, have 
a file descriptor open as read only (could also be write) on a file from 
that share, and I'm suddenly changing the IP address of that client.

Obviously, the NFS share will hang, so if I now try to read the file 
descriptor I've got open (here in Python), the "read" call will also hang.

However, the driver seems to attempt to do something (maybe 
save/determine whether the existing connection can be saved) and then, 
after about 20 minutes the driver transparently reconnects to the NFS 
share (which is what I wanted anyways) and the "read" call instantiated 
earlier simply finishes (I don't even have to re-open the file again or 
even call "read" again).

The dmesg prints I get are as follow:

[ 4424.500380] nfs: server 10.0.2.17 not responding, still trying <-- 
changed IP address and started reading the file.
[ 4451.560467] nfs: server 10.0.2.17 OK <--- The NFS share was 
reconnected, the "read" call completes successfully.

I would like to know if there was any way to tune this behaviour, 
telling the NFS driver to reconnect if a share is unavailable after say 
10 seconds.

I tried the following options without any success:

retry=0; hard/soft; timeo=3; retrans=1; bg/fg

I am running on a custom distro (homemade embedded distro, not based on 
anything in particular) running stock kernel 3.10.18 compiled for i686.

Would anyone know what I could do to force NFS into reconnecting a 
seemingly "dead" session sooner?

Thanks in advance for your help.

Regards,

Ben - MPSTOR.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NFS auto-reconnect tuning.
  2014-09-24 15:39 NFS auto-reconnect tuning Benjamin ESTRABAUD
@ 2014-09-25  1:44 ` NeilBrown
  2014-09-25  9:46   ` Benjamin ESTRABAUD
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2014-09-25  1:44 UTC (permalink / raw)
  To: Benjamin ESTRABAUD; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 3247 bytes --]

On Wed, 24 Sep 2014 16:39:55 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:

> Hi!
> 
> I've got a scenario where I'm connected to a NFS share on a client, have 
> a file descriptor open as read only (could also be write) on a file from 
> that share, and I'm suddenly changing the IP address of that client.
> 
> Obviously, the NFS share will hang, so if I now try to read the file 
> descriptor I've got open (here in Python), the "read" call will also hang.
> 
> However, the driver seems to attempt to do something (maybe 
> save/determine whether the existing connection can be saved) and then, 
> after about 20 minutes the driver transparently reconnects to the NFS 
> share (which is what I wanted anyways) and the "read" call instantiated 
> earlier simply finishes (I don't even have to re-open the file again or 
> even call "read" again).
> 
> The dmesg prints I get are as follow:
> 
> [ 4424.500380] nfs: server 10.0.2.17 not responding, still trying <-- 
> changed IP address and started reading the file.
> [ 4451.560467] nfs: server 10.0.2.17 OK <--- The NFS share was 
> reconnected, the "read" call completes successfully.

The difference between these timestamps is 27 seconds, which is a lot less
than the "20 minutes" that you quote.  That seems odd.

If you adjust
   /proc/sys/net/ipv4/tcp_retries2

you can reduce the current timeout.
See Documentation/networking/ip-sysctl.txt for details on the setting.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

It claims the default gives an effective timeout of 924 seconds or about 15
minutes.

I just tried and the timeout was 1047 seconds. This is probably the next
retry after 924 seconds.

If I reduce tcp_retries2 to '3' (well below the recommended minimum) I get
a timeout of 5 seconds.
You can possibly find a suitable number that isn't too small...

Alternately you could use NFSv4.  It will close the connection on a timeout.
In the default config I measure a 78 second timeout, which is probably more
acceptable.  This number would respond to the timeo mount option.
If I set that to 100, I get a 28 second timeout.

The same effect could be provided for NFSv3 by setting:

           __set_bit(NFS_CS_DISCRTRY, &clp->cl_flags);

somewhere appropriate.  I wonder why that isn't being done for v3 already...
Probably some subtle protocol difference.

NeilBrown

 
> I would like to know if there was any way to tune this behaviour, 
> telling the NFS driver to reconnect if a share is unavailable after say 
> 10 seconds.
> 
> I tried the following options without any success:
> 
> retry=0; hard/soft; timeo=3; retrans=1; bg/fg
> 
> I am running on a custom distro (homemade embedded distro, not based on 
> anything in particular) running stock kernel 3.10.18 compiled for i686.
> 
> Would anyone know what I could do to force NFS into reconnecting a 
> seemingly "dead" session sooner?
> 
> Thanks in advance for your help.
> 
> Regards,
> 
> Ben - MPSTOR.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NFS auto-reconnect tuning.
  2014-09-25  1:44 ` NeilBrown
@ 2014-09-25  9:46   ` Benjamin ESTRABAUD
  2014-09-28 23:28     ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin ESTRABAUD @ 2014-09-25  9:46 UTC (permalink / raw)
  To: linux-nfs; +Cc: NeilBrown

On 25/09/14 02:44, NeilBrown wrote:
> On Wed, 24 Sep 2014 16:39:55 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:
>
>> Hi!
>>
>> I've got a scenario where I'm connected to a NFS share on a client, have
>> a file descriptor open as read only (could also be write) on a file from
>> that share, and I'm suddenly changing the IP address of that client.
>>
>> Obviously, the NFS share will hang, so if I now try to read the file
>> descriptor I've got open (here in Python), the "read" call will also hang.
>>
>> However, the driver seems to attempt to do something (maybe
>> save/determine whether the existing connection can be saved) and then,
>> after about 20 minutes the driver transparently reconnects to the NFS
>> share (which is what I wanted anyways) and the "read" call instantiated
>> earlier simply finishes (I don't even have to re-open the file again or
>> even call "read" again).
>>
>> The dmesg prints I get are as follow:
>>
>> [ 4424.500380] nfs: server 10.0.2.17 not responding, still trying <--
>> changed IP address and started reading the file.
>> [ 4451.560467] nfs: server 10.0.2.17 OK <--- The NFS share was
>> reconnected, the "read" call completes successfully.
>
> The difference between these timestamps is 27 seconds, which is a lot less
> than the "20 minutes" that you quote.  That seems odd.
>
Hi Neil,

My bad, I had made several attempts and must have copied the wrong dmesg 
trace. The above happened when I manually reverted the IP config back to 
its original address (when doing so the driver reconnects immediately).

Here is what had happened:

[ 1663.940406] nfs: server 10.0.2.17 not responding, still trying
[ 2712.480325] nfs: server 10.0.2.17 OK

> If you adjust
>     /proc/sys/net/ipv4/tcp_retries2
>
> you can reduce the current timeout.
> See Documentation/networking/ip-sysctl.txt for details on the setting.
>
> https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
>
> It claims the default gives an effective timeout of 924 seconds or about 15
> minutes.
>
> I just tried and the timeout was 1047 seconds. This is probably the next
> retry after 924 seconds.
>
> If I reduce tcp_retries2 to '3' (well below the recommended minimum) I get
> a timeout of 5 seconds.
> You can possibly find a suitable number that isn't too small...
>
That's very interesting! Thank you very much! However, I'm a bit worried 
when changing the whole TCP stack settings, NFS is only one small chunk 
of a much bigger network storage box, so if there are alternative it'll 
probably be better. Also I would need a very very small timeout, in the 
order of 10-20 secs *max* so that would probably cause other issues 
elsewhere, but this is very interesting indeed.

> Alternately you could use NFSv4.  It will close the connection on a timeout.
> In the default config I measure a 78 second timeout, which is probably more
> acceptable.  This number would respond to the timeo mount option.
> If I set that to 100, I get a 28 second timeout.
>
This is great! I had no idea, I will definitely roll NFSv4 and try that. 
Thanks again for your help!

> The same effect could be provided for NFSv3 by setting:
>
>             __set_bit(NFS_CS_DISCRTRY, &clp->cl_flags);
>
> somewhere appropriate.  I wonder why that isn't being done for v3 already...
> Probably some subtle protocol difference.
If for some reason we can't stick to v4 we'll try that too, thanks.

>
> NeilBrown
>
>
Regards,

Ben - MPSTOR.

>> I would like to know if there was any way to tune this behaviour,
>> telling the NFS driver to reconnect if a share is unavailable after say
>> 10 seconds.
>>
>> I tried the following options without any success:
>>
>> retry=0; hard/soft; timeo=3; retrans=1; bg/fg
>>
>> I am running on a custom distro (homemade embedded distro, not based on
>> anything in particular) running stock kernel 3.10.18 compiled for i686.
>>
>> Would anyone know what I could do to force NFS into reconnecting a
>> seemingly "dead" session sooner?
>>
>> Thanks in advance for your help.
>>
>> Regards,
>>
>> Ben - MPSTOR.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NFS auto-reconnect tuning.
  2014-09-25  9:46   ` Benjamin ESTRABAUD
@ 2014-09-28 23:28     ` NeilBrown
  2014-09-29 10:06       ` Benjamin ESTRABAUD
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2014-09-28 23:28 UTC (permalink / raw)
  To: Benjamin ESTRABAUD; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 4064 bytes --]

On Thu, 25 Sep 2014 10:46:09 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:

> On 25/09/14 02:44, NeilBrown wrote:
> > On Wed, 24 Sep 2014 16:39:55 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:
> >
> >> Hi!
> >>
> >> I've got a scenario where I'm connected to a NFS share on a client, have
> >> a file descriptor open as read only (could also be write) on a file from
> >> that share, and I'm suddenly changing the IP address of that client.
> >>
> >> Obviously, the NFS share will hang, so if I now try to read the file
> >> descriptor I've got open (here in Python), the "read" call will also hang.
> >>
> >> However, the driver seems to attempt to do something (maybe
> >> save/determine whether the existing connection can be saved) and then,
> >> after about 20 minutes the driver transparently reconnects to the NFS
> >> share (which is what I wanted anyways) and the "read" call instantiated
> >> earlier simply finishes (I don't even have to re-open the file again or
> >> even call "read" again).
> >>
> >> The dmesg prints I get are as follow:
> >>
> >> [ 4424.500380] nfs: server 10.0.2.17 not responding, still trying <--
> >> changed IP address and started reading the file.
> >> [ 4451.560467] nfs: server 10.0.2.17 OK <--- The NFS share was
> >> reconnected, the "read" call completes successfully.
> >
> > The difference between these timestamps is 27 seconds, which is a lot less
> > than the "20 minutes" that you quote.  That seems odd.
> >
> Hi Neil,
> 
> My bad, I had made several attempts and must have copied the wrong dmesg 
> trace. The above happened when I manually reverted the IP config back to 
> its original address (when doing so the driver reconnects immediately).
> 
> Here is what had happened:
> 
> [ 1663.940406] nfs: server 10.0.2.17 not responding, still trying
> [ 2712.480325] nfs: server 10.0.2.17 OK
> 
> > If you adjust
> >     /proc/sys/net/ipv4/tcp_retries2
> >
> > you can reduce the current timeout.
> > See Documentation/networking/ip-sysctl.txt for details on the setting.
> >
> > https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> >
> > It claims the default gives an effective timeout of 924 seconds or about 15
> > minutes.
> >
> > I just tried and the timeout was 1047 seconds. This is probably the next
> > retry after 924 seconds.
> >
> > If I reduce tcp_retries2 to '3' (well below the recommended minimum) I get
> > a timeout of 5 seconds.
> > You can possibly find a suitable number that isn't too small...
> >
> That's very interesting! Thank you very much! However, I'm a bit worried 
> when changing the whole TCP stack settings, NFS is only one small chunk 
> of a much bigger network storage box, so if there are alternative it'll 
> probably be better. Also I would need a very very small timeout, in the 
> order of 10-20 secs *max* so that would probably cause other issues 
> elsewhere, but this is very interesting indeed.
> 
> > Alternately you could use NFSv4.  It will close the connection on a timeout.
> > In the default config I measure a 78 second timeout, which is probably more
> > acceptable.  This number would respond to the timeo mount option.
> > If I set that to 100, I get a 28 second timeout.
> >
> This is great! I had no idea, I will definitely roll NFSv4 and try that. 
> Thanks again for your help!

Actually ... it turns out that NFSv4 shouldn't close the connection early
like that.  It happens due to a bug which is now being fixed :-)

Probably the real problem is that the TCP KEEPALIVE feature isn't working
properly.  NFS configures it so that keep-alives are sent at the 'timeout'
time and the connection should close if a reply is not seen fairly soon.

However TCP does not send keepalives when the are packets in the queue
waiting to go out (which is appropriate) and also doesn't check for timeouts
problem when the queue is full.

I'll post to net-dev asking if I've understood this correctly and will take
the liberty of cc:ing you.

NeilBrown



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NFS auto-reconnect tuning.
  2014-09-28 23:28     ` NeilBrown
@ 2014-09-29 10:06       ` Benjamin ESTRABAUD
  2014-09-29 21:34         ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin ESTRABAUD @ 2014-09-29 10:06 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On 29/09/14 00:28, NeilBrown wrote:
> On Thu, 25 Sep 2014 10:46:09 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:
>
>> On 25/09/14 02:44, NeilBrown wrote:
>>> On Wed, 24 Sep 2014 16:39:55 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> I've got a scenario where I'm connected to a NFS share on a client, have
>>>> a file descriptor open as read only (could also be write) on a file from
>>>> that share, and I'm suddenly changing the IP address of that client.
>>>>
>>>> Obviously, the NFS share will hang, so if I now try to read the file
>>>> descriptor I've got open (here in Python), the "read" call will also hang.
>>>>
>>>> However, the driver seems to attempt to do something (maybe
>>>> save/determine whether the existing connection can be saved) and then,
>>>> after about 20 minutes the driver transparently reconnects to the NFS
>>>> share (which is what I wanted anyways) and the "read" call instantiated
>>>> earlier simply finishes (I don't even have to re-open the file again or
>>>> even call "read" again).
>>>>
>>>> The dmesg prints I get are as follow:
>>>>
>>>> [ 4424.500380] nfs: server 10.0.2.17 not responding, still trying <--
>>>> changed IP address and started reading the file.
>>>> [ 4451.560467] nfs: server 10.0.2.17 OK <--- The NFS share was
>>>> reconnected, the "read" call completes successfully.
>>>
>>> The difference between these timestamps is 27 seconds, which is a lot less
>>> than the "20 minutes" that you quote.  That seems odd.
>>>
>> Hi Neil,
>>
>> My bad, I had made several attempts and must have copied the wrong dmesg
>> trace. The above happened when I manually reverted the IP config back to
>> its original address (when doing so the driver reconnects immediately).
>>
>> Here is what had happened:
>>
>> [ 1663.940406] nfs: server 10.0.2.17 not responding, still trying
>> [ 2712.480325] nfs: server 10.0.2.17 OK
>>
>>> If you adjust
>>>      /proc/sys/net/ipv4/tcp_retries2
>>>
>>> you can reduce the current timeout.
>>> See Documentation/networking/ip-sysctl.txt for details on the setting.
>>>
>>> https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
>>>
>>> It claims the default gives an effective timeout of 924 seconds or about 15
>>> minutes.
>>>
>>> I just tried and the timeout was 1047 seconds. This is probably the next
>>> retry after 924 seconds.
>>>
>>> If I reduce tcp_retries2 to '3' (well below the recommended minimum) I get
>>> a timeout of 5 seconds.
>>> You can possibly find a suitable number that isn't too small...
>>>
>> That's very interesting! Thank you very much! However, I'm a bit worried
>> when changing the whole TCP stack settings, NFS is only one small chunk
>> of a much bigger network storage box, so if there are alternative it'll
>> probably be better. Also I would need a very very small timeout, in the
>> order of 10-20 secs *max* so that would probably cause other issues
>> elsewhere, but this is very interesting indeed.
>>
>>> Alternately you could use NFSv4.  It will close the connection on a timeout.
>>> In the default config I measure a 78 second timeout, which is probably more
>>> acceptable.  This number would respond to the timeo mount option.
>>> If I set that to 100, I get a 28 second timeout.
>>>
>> This is great! I had no idea, I will definitely roll NFSv4 and try that.
>> Thanks again for your help!
>
> Actually ... it turns out that NFSv4 shouldn't close the connection early
> like that.  It happens due to a bug which is now being fixed :-)
Well, maybe I could "patch" NFSv4 here for my purpose or use the patch 
you provided before for NFSv3, although I admit it would be easier to 
use a stock kernel if possible.
>
> Probably the real problem is that the TCP KEEPALIVE feature isn't working
> properly.  NFS configures it so that keep-alives are sent at the 'timeout'
> time and the connection should close if a reply is not seen fairly soon.
>
I wouldn't mind using TCP Keepalives but I am worried that I'd have to 
change a TCP wide setting, which other applications might rely on (I 
read that the TCP keepalive time for instance should be no less than 2 
hours). Could NFS just have a "custom" TCP keepalive and leave the 
global, default setting untouched?

> However TCP does not send keepalives when the are packets in the queue
> waiting to go out (which is appropriate) and also doesn't check for timeouts
> problem when the queue is full.
>
So if I understand correctly, the keepalives are sent when the 
connection is completely idle, but if the connection break happened 
during a transfer (queue not empty) then NFS would never find out as it 
wouldn't send anymore keepalives?

> I'll post to net-dev asking if I've understood this correctly and will take
> the liberty of cc:ing you.
Thank you very much for this, this will help.

>
> NeilBrown
>
>
Ben - MPSTOR


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: NFS auto-reconnect tuning.
  2014-09-29 10:06       ` Benjamin ESTRABAUD
@ 2014-09-29 21:34         ` NeilBrown
  0 siblings, 0 replies; 6+ messages in thread
From: NeilBrown @ 2014-09-29 21:34 UTC (permalink / raw)
  To: Benjamin ESTRABAUD; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 5371 bytes --]

On Mon, 29 Sep 2014 11:06:26 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:

> On 29/09/14 00:28, NeilBrown wrote:
> > On Thu, 25 Sep 2014 10:46:09 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:
> >
> >> On 25/09/14 02:44, NeilBrown wrote:
> >>> On Wed, 24 Sep 2014 16:39:55 +0100 Benjamin ESTRABAUD <be@mpstor.com> wrote:
> >>>
> >>>> Hi!
> >>>>
> >>>> I've got a scenario where I'm connected to a NFS share on a client, have
> >>>> a file descriptor open as read only (could also be write) on a file from
> >>>> that share, and I'm suddenly changing the IP address of that client.
> >>>>
> >>>> Obviously, the NFS share will hang, so if I now try to read the file
> >>>> descriptor I've got open (here in Python), the "read" call will also hang.
> >>>>
> >>>> However, the driver seems to attempt to do something (maybe
> >>>> save/determine whether the existing connection can be saved) and then,
> >>>> after about 20 minutes the driver transparently reconnects to the NFS
> >>>> share (which is what I wanted anyways) and the "read" call instantiated
> >>>> earlier simply finishes (I don't even have to re-open the file again or
> >>>> even call "read" again).
> >>>>
> >>>> The dmesg prints I get are as follow:
> >>>>
> >>>> [ 4424.500380] nfs: server 10.0.2.17 not responding, still trying <--
> >>>> changed IP address and started reading the file.
> >>>> [ 4451.560467] nfs: server 10.0.2.17 OK <--- The NFS share was
> >>>> reconnected, the "read" call completes successfully.
> >>>
> >>> The difference between these timestamps is 27 seconds, which is a lot less
> >>> than the "20 minutes" that you quote.  That seems odd.
> >>>
> >> Hi Neil,
> >>
> >> My bad, I had made several attempts and must have copied the wrong dmesg
> >> trace. The above happened when I manually reverted the IP config back to
> >> its original address (when doing so the driver reconnects immediately).
> >>
> >> Here is what had happened:
> >>
> >> [ 1663.940406] nfs: server 10.0.2.17 not responding, still trying
> >> [ 2712.480325] nfs: server 10.0.2.17 OK
> >>
> >>> If you adjust
> >>>      /proc/sys/net/ipv4/tcp_retries2
> >>>
> >>> you can reduce the current timeout.
> >>> See Documentation/networking/ip-sysctl.txt for details on the setting.
> >>>
> >>> https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> >>>
> >>> It claims the default gives an effective timeout of 924 seconds or about 15
> >>> minutes.
> >>>
> >>> I just tried and the timeout was 1047 seconds. This is probably the next
> >>> retry after 924 seconds.
> >>>
> >>> If I reduce tcp_retries2 to '3' (well below the recommended minimum) I get
> >>> a timeout of 5 seconds.
> >>> You can possibly find a suitable number that isn't too small...
> >>>
> >> That's very interesting! Thank you very much! However, I'm a bit worried
> >> when changing the whole TCP stack settings, NFS is only one small chunk
> >> of a much bigger network storage box, so if there are alternative it'll
> >> probably be better. Also I would need a very very small timeout, in the
> >> order of 10-20 secs *max* so that would probably cause other issues
> >> elsewhere, but this is very interesting indeed.
> >>
> >>> Alternately you could use NFSv4.  It will close the connection on a timeout.
> >>> In the default config I measure a 78 second timeout, which is probably more
> >>> acceptable.  This number would respond to the timeo mount option.
> >>> If I set that to 100, I get a 28 second timeout.
> >>>
> >> This is great! I had no idea, I will definitely roll NFSv4 and try that.
> >> Thanks again for your help!
> >
> > Actually ... it turns out that NFSv4 shouldn't close the connection early
> > like that.  It happens due to a bug which is now being fixed :-)
> Well, maybe I could "patch" NFSv4 here for my purpose or use the patch 
> you provided before for NFSv3, although I admit it would be easier to 
> use a stock kernel if possible.

You could.  Certainly safer to stick with stock kernel if possible (and we
appreciated the broader testing coverage!).

> >
> > Probably the real problem is that the TCP KEEPALIVE feature isn't working
> > properly.  NFS configures it so that keep-alives are sent at the 'timeout'
> > time and the connection should close if a reply is not seen fairly soon.
> >
> I wouldn't mind using TCP Keepalives but I am worried that I'd have to 
> change a TCP wide setting, which other applications might rely on (I 
> read that the TCP keepalive time for instance should be no less than 2 
> hours). Could NFS just have a "custom" TCP keepalive and leave the 
> global, default setting untouched?

That is exactly what NFS does - it sets the keep-alive settings just for the
TCP connection that NFS uses.
The problem is that TCP keep-alives don't quite work as required.


> 
> > However TCP does not send keepalives when the are packets in the queue
> > waiting to go out (which is appropriate) and also doesn't check for timeouts
> > problem when the queue is full.
> >
> So if I understand correctly, the keepalives are sent when the 
> connection is completely idle, but if the connection break happened 
> during a transfer (queue not empty) then NFS would never find out as it 
> wouldn't send anymore keepalives?

Exactly.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-09-29 21:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-24 15:39 NFS auto-reconnect tuning Benjamin ESTRABAUD
2014-09-25  1:44 ` NeilBrown
2014-09-25  9:46   ` Benjamin ESTRABAUD
2014-09-28 23:28     ` NeilBrown
2014-09-29 10:06       ` Benjamin ESTRABAUD
2014-09-29 21:34         ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.