* Failure to reconnect after cluster failvoer
@ 2019-02-21 16:57 Ross Lagerwall
2019-02-21 17:06 ` Steve French
0 siblings, 1 reply; 7+ messages in thread
From: Ross Lagerwall @ 2019-02-21 16:57 UTC (permalink / raw)
To: linux-cifs
Hi,
I have an issue with SMB cluster failover. There are two Windows 2012 R2
Datacenter servers in the cluster. If the primary server is turned off,
then the secondary server becomes the primary. However, when this
happens the kernel client is not able to recover the mount.
Here is the reconnection network trace:
Time Source Destination Protocol Length Info
16.640530 10.71.217.53 10.71.217.50 SMB2 172 Negotiate Protocol
Request
16.641723 10.71.217.50 10.71.217.53 SMB2 318 Negotiate Protocol
Response
16.641799 10.71.217.53 10.71.217.50 SMB2 190 Session Setup
Request, NTLMSSP_NEGOTIATE
16.642148 10.71.217.50 10.71.217.53 SMB2 442 Session Setup
Response, Error: STATUS_MORE_PROCESSING_REQUIRED, NTLMSSP_CHALLENGE
16.642201 10.71.217.53 10.71.217.50 SMB2 562 Session Setup
Request, NTLMSSP_AUTH, User: clusterad.local7337\Administrator
16.656407 10.71.217.50 10.71.217.53 SMB2 142 Session Setup Response
16.656492 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
16.656916 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
Response, Error: STATUS_BAD_NETWORK_NAME
16.659249 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
16.659635 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
Response, Error: STATUS_BAD_NETWORK_NAME
20.224591 10.71.217.53 10.71.217.50 SMB2 182 Tree Connect Request
Tree: \\10.71.217.50\IPC$
20.225344 10.71.217.50 10.71.217.53 SMB2 150 Tree Connect Response
20.225449 10.71.217.53 10.71.217.50 SMB2 216 Ioctl Request
FSCTL_VALIDATE_NEGOTIATE_INFO
20.225934 10.71.217.50 10.71.217.53 SMB2 206 Ioctl Response
FSCTL_VALIDATE_NEGOTIATE_INFO
20.225975 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
20.226355 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
Response, Error: STATUS_BAD_NETWORK_NAME
22.240595 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
22.241159 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
Response, Error: STATUS_BAD_NETWORK_NAME
24.256590 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
24.257380 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
Response, Error: STATUS_BAD_NETWORK_NAME
...
40.384609 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
40.385135 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
Response, Error: STATUS_BAD_NETWORK_NAME
41.772006 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
41.772562 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
Response, Error: STATUS_NETWORK_NAME_DELETED
41.772641 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
41.773037 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
Response, Error: STATUS_NETWORK_NAME_DELETED
42.400589 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
Tree: \\10.71.217.50\smbshare
...
After the secondary server takes over (presumably once it stops
returning STATUS_BAD_NETWORK_NAME), it then returns
STATUS_NETWORK_NAME_DELETED indefinitely.
This can be fixed by delaying the tree connect to IPC$ until after the
tree connect to the share succeeds. The server then no longer returns
STATUS_NETWORK_NAME_DELETED and instead responds successfully. I'm not
sure why the server behaves like this and I'm not sure if the client is
doing something wrong. I found this out because it used to work on older
kernels before b327a717e506 ("CIFS: make IPC a regular tcon").
Here is the patch that makes it work:
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index dba986524917..1f97ed6459bf 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2864,7 +2864,14 @@ void smb2_reconnect_server(struct work_struct *work)
spin_unlock(&cifs_tcp_ses_lock);
+ rc = 0;
list_for_each_entry_safe(tcon, tcon2, &tmp_list, rlist) {
+ if (rc) {
+ list_del_init(&tcon->rlist);
+ cifs_put_tcon(tcon);
+ continue;
+ }
+
rc = smb2_reconnect(SMB2_INTERNAL_CMD, tcon);
if (!rc)
cifs_reopen_persistent_handles(tcon);
Can anyone give any more info on this oddity and whether this is a
useful patch?
Thanks,
--
Ross Lagerwall
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: Failure to reconnect after cluster failvoer
2019-02-21 16:57 Failure to reconnect after cluster failvoer Ross Lagerwall
@ 2019-02-21 17:06 ` Steve French
2019-02-21 17:59 ` Tom Talpey
0 siblings, 1 reply; 7+ messages in thread
From: Steve French @ 2019-02-21 17:06 UTC (permalink / raw)
To: Ross Lagerwall; +Cc: CIFS
Couple quick thoughts.
Does this work on current kernels (5.0 for example).
Was thinking about patches that might affect this like:
- "cifs: connect to servername instead of IP for IPC$ share"
- "smb3: on reconnect set PreviousSessionId field"
- Paulo's patches (has cifs-utils coreq) to reconnect to new IP
address if hostname's IP address changed and his add support for
failover
- Paulo's patch to remove trailing slashes from server UNC name
On Thu, Feb 21, 2019 at 10:58 AM Ross Lagerwall
<ross.lagerwall@citrix.com> wrote:
>
> Hi,
>
> I have an issue with SMB cluster failover. There are two Windows 2012 R2
> Datacenter servers in the cluster. If the primary server is turned off,
> then the secondary server becomes the primary. However, when this
> happens the kernel client is not able to recover the mount.
>
> Here is the reconnection network trace:
>
> Time Source Destination Protocol Length Info
> 16.640530 10.71.217.53 10.71.217.50 SMB2 172 Negotiate Protocol
> Request
> 16.641723 10.71.217.50 10.71.217.53 SMB2 318 Negotiate Protocol
> Response
> 16.641799 10.71.217.53 10.71.217.50 SMB2 190 Session Setup
> Request, NTLMSSP_NEGOTIATE
> 16.642148 10.71.217.50 10.71.217.53 SMB2 442 Session Setup
> Response, Error: STATUS_MORE_PROCESSING_REQUIRED, NTLMSSP_CHALLENGE
> 16.642201 10.71.217.53 10.71.217.50 SMB2 562 Session Setup
> Request, NTLMSSP_AUTH, User: clusterad.local7337\Administrator
> 16.656407 10.71.217.50 10.71.217.53 SMB2 142 Session Setup Response
> 16.656492 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 16.656916 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 16.659249 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 16.659635 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 20.224591 10.71.217.53 10.71.217.50 SMB2 182 Tree Connect Request
> Tree: \\10.71.217.50\IPC$
> 20.225344 10.71.217.50 10.71.217.53 SMB2 150 Tree Connect Response
> 20.225449 10.71.217.53 10.71.217.50 SMB2 216 Ioctl Request
> FSCTL_VALIDATE_NEGOTIATE_INFO
> 20.225934 10.71.217.50 10.71.217.53 SMB2 206 Ioctl Response
> FSCTL_VALIDATE_NEGOTIATE_INFO
> 20.225975 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 20.226355 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 22.240595 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 22.241159 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 24.256590 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 24.257380 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> ...
> 40.384609 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 40.385135 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 41.772006 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 41.772562 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_NETWORK_NAME_DELETED
> 41.772641 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 41.773037 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_NETWORK_NAME_DELETED
> 42.400589 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> ...
>
> After the secondary server takes over (presumably once it stops
> returning STATUS_BAD_NETWORK_NAME), it then returns
> STATUS_NETWORK_NAME_DELETED indefinitely.
>
> This can be fixed by delaying the tree connect to IPC$ until after the
> tree connect to the share succeeds. The server then no longer returns
> STATUS_NETWORK_NAME_DELETED and instead responds successfully. I'm not
> sure why the server behaves like this and I'm not sure if the client is
> doing something wrong. I found this out because it used to work on older
> kernels before b327a717e506 ("CIFS: make IPC a regular tcon").
>
> Here is the patch that makes it work:
>
> diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
> index dba986524917..1f97ed6459bf 100644
> --- a/fs/cifs/smb2pdu.c
> +++ b/fs/cifs/smb2pdu.c
> @@ -2864,7 +2864,14 @@ void smb2_reconnect_server(struct work_struct *work)
>
> spin_unlock(&cifs_tcp_ses_lock);
>
> + rc = 0;
> list_for_each_entry_safe(tcon, tcon2, &tmp_list, rlist) {
> + if (rc) {
> + list_del_init(&tcon->rlist);
> + cifs_put_tcon(tcon);
> + continue;
> + }
> +
> rc = smb2_reconnect(SMB2_INTERNAL_CMD, tcon);
> if (!rc)
> cifs_reopen_persistent_handles(tcon);
>
> Can anyone give any more info on this oddity and whether this is a
> useful patch?
>
> Thanks,
> --
> Ross Lagerwall
--
Thanks,
Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Failure to reconnect after cluster failvoer
2019-02-21 17:06 ` Steve French
@ 2019-02-21 17:59 ` Tom Talpey
2019-02-22 17:16 ` Ross Lagerwall
0 siblings, 1 reply; 7+ messages in thread
From: Tom Talpey @ 2019-02-21 17:59 UTC (permalink / raw)
To: Steve French, Ross Lagerwall; +Cc: CIFS
The reconnect is apparently using a dotted-quad as the servername, and you can see the auth is forced to NTLM as a consequence. Is that the way you initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)?
-----Original Message-----
From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org> On Behalf Of Steve French
Sent: Thursday, February 21, 2019 9:07 AM
To: Ross Lagerwall <ross.lagerwall@citrix.com>
Cc: CIFS <linux-cifs@vger.kernel.org>
Subject: Re: Failure to reconnect after cluster failvoer
Couple quick thoughts.
Does this work on current kernels (5.0 for example).
Was thinking about patches that might affect this like:
- "cifs: connect to servername instead of IP for IPC$ share"
- "smb3: on reconnect set PreviousSessionId field"
- Paulo's patches (has cifs-utils coreq) to reconnect to new IP
address if hostname's IP address changed and his add support for
failover
- Paulo's patch to remove trailing slashes from server UNC name
On Thu, Feb 21, 2019 at 10:58 AM Ross Lagerwall
<ross.lagerwall@citrix.com> wrote:
>
> Hi,
>
> I have an issue with SMB cluster failover. There are two Windows 2012 R2
> Datacenter servers in the cluster. If the primary server is turned off,
> then the secondary server becomes the primary. However, when this
> happens the kernel client is not able to recover the mount.
>
> Here is the reconnection network trace:
>
> Time Source Destination Protocol Length Info
> 16.640530 10.71.217.53 10.71.217.50 SMB2 172 Negotiate Protocol
> Request
> 16.641723 10.71.217.50 10.71.217.53 SMB2 318 Negotiate Protocol
> Response
> 16.641799 10.71.217.53 10.71.217.50 SMB2 190 Session Setup
> Request, NTLMSSP_NEGOTIATE
> 16.642148 10.71.217.50 10.71.217.53 SMB2 442 Session Setup
> Response, Error: STATUS_MORE_PROCESSING_REQUIRED, NTLMSSP_CHALLENGE
> 16.642201 10.71.217.53 10.71.217.50 SMB2 562 Session Setup
> Request, NTLMSSP_AUTH, User: clusterad.local7337\Administrator
> 16.656407 10.71.217.50 10.71.217.53 SMB2 142 Session Setup Response
> 16.656492 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 16.656916 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 16.659249 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 16.659635 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 20.224591 10.71.217.53 10.71.217.50 SMB2 182 Tree Connect Request
> Tree: \\10.71.217.50\IPC$
> 20.225344 10.71.217.50 10.71.217.53 SMB2 150 Tree Connect Response
> 20.225449 10.71.217.53 10.71.217.50 SMB2 216 Ioctl Request
> FSCTL_VALIDATE_NEGOTIATE_INFO
> 20.225934 10.71.217.50 10.71.217.53 SMB2 206 Ioctl Response
> FSCTL_VALIDATE_NEGOTIATE_INFO
> 20.225975 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 20.226355 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 22.240595 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 22.241159 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 24.256590 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 24.257380 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> ...
> 40.384609 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 40.385135 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 41.772006 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 41.772562 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_NETWORK_NAME_DELETED
> 41.772641 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 41.773037 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect
> Response, Error: STATUS_NETWORK_NAME_DELETED
> 42.400589 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> ...
>
> After the secondary server takes over (presumably once it stops
> returning STATUS_BAD_NETWORK_NAME), it then returns
> STATUS_NETWORK_NAME_DELETED indefinitely.
>
> This can be fixed by delaying the tree connect to IPC$ until after the
> tree connect to the share succeeds. The server then no longer returns
> STATUS_NETWORK_NAME_DELETED and instead responds successfully. I'm not
> sure why the server behaves like this and I'm not sure if the client is
> doing something wrong. I found this out because it used to work on older
> kernels before b327a717e506 ("CIFS: make IPC a regular tcon").
>
> Here is the patch that makes it work:
>
> diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
> index dba986524917..1f97ed6459bf 100644
> --- a/fs/cifs/smb2pdu.c
> +++ b/fs/cifs/smb2pdu.c
> @@ -2864,7 +2864,14 @@ void smb2_reconnect_server(struct work_struct *work)
>
> spin_unlock(&cifs_tcp_ses_lock);
>
> + rc = 0;
> list_for_each_entry_safe(tcon, tcon2, &tmp_list, rlist) {
> + if (rc) {
> + list_del_init(&tcon->rlist);
> + cifs_put_tcon(tcon);
> + continue;
> + }
> +
> rc = smb2_reconnect(SMB2_INTERNAL_CMD, tcon);
> if (!rc)
> cifs_reopen_persistent_handles(tcon);
>
> Can anyone give any more info on this oddity and whether this is a
> useful patch?
>
> Thanks,
> --
> Ross Lagerwall
--
Thanks,
Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Failure to reconnect after cluster failvoer
2019-02-21 17:59 ` Tom Talpey
@ 2019-02-22 17:16 ` Ross Lagerwall
2019-02-22 23:25 ` Tom Talpey
0 siblings, 1 reply; 7+ messages in thread
From: Ross Lagerwall @ 2019-02-22 17:16 UTC (permalink / raw)
To: Tom Talpey, Steve French; +Cc: CIFS
On 2/21/19 5:59 PM, Tom Talpey wrote:
> The reconnect is apparently using a dotted-quad as the servername, and you can see the auth is forced to NTLM as a consequence. Is that the way you initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)?
>
> -----Original Message-----
> From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org> On Behalf Of Steve French
> Sent: Thursday, February 21, 2019 9:07 AM
> To: Ross Lagerwall <ross.lagerwall@citrix.com>
> Cc: CIFS <linux-cifs@vger.kernel.org>
> Subject: Re: Failure to reconnect after cluster failvoer
>
> Couple quick thoughts.
>
> Does this work on current kernels (5.0 for example).
>
> Was thinking about patches that might affect this like:
> - "cifs: connect to servername instead of IP for IPC$ share"
> - "smb3: on reconnect set PreviousSessionId field"
> - Paulo's patches (has cifs-utils coreq) to reconnect to new IP
> address if hostname's IP address changed and his add support for
> failover
> - Paulo's patch to remove trailing slashes from server UNC name
>
I've reproduced this with 5.0-rc7 and the latest cifs-utils from git.
The share was mounted as follows (yes, by IP):
mount.cifs -o
vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z
'//10.71.217.31/smbshare' /mnt
Here is the tcpdump when it fails to reconnect properly:
http://s000.tinyupload.com/index.php?file_id=55518118986864684971
The initial connection is at timestamp 0s, reconnection at 13s,
STATUS_NETWORK_NAME_DELETED at 60s.
For comparison, here is a tcpdump using the "fix" from my previous mail:
http://s000.tinyupload.com/index.php?file_id=04243963024741599425
The initial connection is at timestamp 0s, reconnection at 34s,
successful read request at 215s.
Note that the tree connect for IPC$ only happens _after_ the tree
connect for the share succeeds.
Thanks,
--
Ross Lagerwall
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Failure to reconnect after cluster failvoer
2019-02-22 17:16 ` Ross Lagerwall
@ 2019-02-22 23:25 ` Tom Talpey
2019-02-25 13:13 ` Ross Lagerwall
0 siblings, 1 reply; 7+ messages in thread
From: Tom Talpey @ 2019-02-22 23:25 UTC (permalink / raw)
To: Ross Lagerwall, Steve French; +Cc: CIFS
> -----Original Message-----
> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> Sent: Friday, February 22, 2019 9:17 AM
> To: Tom Talpey <ttalpey@microsoft.com>; Steve French
> <smfrench@gmail.com>
> Cc: CIFS <linux-cifs@vger.kernel.org>
> Subject: Re: Failure to reconnect after cluster failvoer
>
> On 2/21/19 5:59 PM, Tom Talpey wrote:
> > The reconnect is apparently using a dotted-quad as the servername, and you
> can see the auth is forced to NTLM as a consequence. Is that the way you
> initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)?
> >
> > -----Original Message-----
> > From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org>
> On Behalf Of Steve French
> > Sent: Thursday, February 21, 2019 9:07 AM
> > To: Ross Lagerwall <ross.lagerwall@citrix.com>
> > Cc: CIFS <linux-cifs@vger.kernel.org>
> > Subject: Re: Failure to reconnect after cluster failvoer
> >
> > Couple quick thoughts.
> >
> > Does this work on current kernels (5.0 for example).
> >
> > Was thinking about patches that might affect this like:
> > - "cifs: connect to servername instead of IP for IPC$ share"
> > - "smb3: on reconnect set PreviousSessionId field"
> > - Paulo's patches (has cifs-utils coreq) to reconnect to new IP
> > address if hostname's IP address changed and his add support for
> > failover
> > - Paulo's patch to remove trailing slashes from server UNC name
> >
> I've reproduced this with 5.0-rc7 and the latest cifs-utils from git.
> The share was mounted as follows (yes, by IP):
>
> mount.cifs -o
> vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z
> '//10.71.217.31/smbshare' /mnt
>
> Here is the tcpdump when it fails to reconnect properly:
...
>
> The initial connection is at timestamp 0s, reconnection at 13s,
> STATUS_NETWORK_NAME_DELETED at 60s.
>
> For comparison, here is a tcpdump using the "fix" from my previous mail:
...
>
> The initial connection is at timestamp 0s, reconnection at 34s,
> successful read request at 215s.
>
> Note that the tree connect for IPC$ only happens _after_ the tree
> connect for the share succeeds.
Thanks for the full traces, they clarify the situation. But, I don’t see any
meaningful difference in the client behavior. The ordering of the two
treeconnects is the same between the two - initially, "IPC$" then
"smbshare", and on reconnect, the other way around. So, I'm unclear
whether your patch did anything.
The STATUS_NETWORK_NAME_DELETED is a consequence of the failed
re-establishment of the tree connect, and is not itself the problem. The
server is simply timing out the treeid, since the client did not successfully
reclaim it. The repeated STATUS_BAD_NETWORK_NAME is the issue.
Are you sure the clustered server is recovering properly when you are
forcing the failover? For example, if it's a two-node cluster, maybe node A
can take over node B, but node B has issues taking over node A. Is there
anything relevant in the server logs?
Tom.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Failure to reconnect after cluster failvoer
2019-02-22 23:25 ` Tom Talpey
@ 2019-02-25 13:13 ` Ross Lagerwall
2019-02-27 14:16 ` Tom Talpey
0 siblings, 1 reply; 7+ messages in thread
From: Ross Lagerwall @ 2019-02-25 13:13 UTC (permalink / raw)
To: Tom Talpey, Steve French; +Cc: CIFS
On 2/22/19 11:25 PM, Tom Talpey wrote:
>> -----Original Message-----
>> From: Ross Lagerwall <ross.lagerwall@citrix.com>
>> Sent: Friday, February 22, 2019 9:17 AM
>> To: Tom Talpey <ttalpey@microsoft.com>; Steve French
>> <smfrench@gmail.com>
>> Cc: CIFS <linux-cifs@vger.kernel.org>
>> Subject: Re: Failure to reconnect after cluster failvoer
>>
>> On 2/21/19 5:59 PM, Tom Talpey wrote:
>>> The reconnect is apparently using a dotted-quad as the servername, and you
>> can see the auth is forced to NTLM as a consequence. Is that the way you
>> initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)?
>>>
>>> -----Original Message-----
>>> From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org>
>> On Behalf Of Steve French
>>> Sent: Thursday, February 21, 2019 9:07 AM
>>> To: Ross Lagerwall <ross.lagerwall@citrix.com>
>>> Cc: CIFS <linux-cifs@vger.kernel.org>
>>> Subject: Re: Failure to reconnect after cluster failvoer
>>>
>>> Couple quick thoughts.
>>>
>>> Does this work on current kernels (5.0 for example).
>>>
>>> Was thinking about patches that might affect this like:
>>> - "cifs: connect to servername instead of IP for IPC$ share"
>>> - "smb3: on reconnect set PreviousSessionId field"
>>> - Paulo's patches (has cifs-utils coreq) to reconnect to new IP
>>> address if hostname's IP address changed and his add support for
>>> failover
>>> - Paulo's patch to remove trailing slashes from server UNC name
>>>
>> I've reproduced this with 5.0-rc7 and the latest cifs-utils from git.
>> The share was mounted as follows (yes, by IP):
>>
>> mount.cifs -o
>> vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z
>> '//10.71.217.31/smbshare' /mnt
>>
>> Here is the tcpdump when it fails to reconnect properly:
> ...
>>
>> The initial connection is at timestamp 0s, reconnection at 13s,
>> STATUS_NETWORK_NAME_DELETED at 60s.
>>
>> For comparison, here is a tcpdump using the "fix" from my previous mail:
> ...
>>
>> The initial connection is at timestamp 0s, reconnection at 34s,
>> successful read request at 215s.
>>
>> Note that the tree connect for IPC$ only happens _after_ the tree
>> connect for the share succeeds.
>
> Thanks for the full traces, they clarify the situation. But, I don’t see any
> meaningful difference in the client behavior. The ordering of the two
> treeconnects is the same between the two - initially, "IPC$" then
> "smbshare", and on reconnect, the other way around. So, I'm unclear
> whether your patch did anything.
There is definitely a difference. Before the patch, on reconnect the client:
* Connects to "smbshare" which fails
* Then connects to "IPC$" which succeeds
* Then tries again to connect to smbshare which fails repeatedly
After the patch, on reconnect the client:
* Connects to "smbshare" which fails
* Then tries again to connect to "smbshare" which succeeds after several
retries
* Then tries to connect to "IPC$" which succeeds
This subtle reordering somehow makes it work. It may indeed be a server
bug rather than a client bug. I was hoping someone could shed some light
on this.
>
> The STATUS_NETWORK_NAME_DELETED is a consequence of the failed
> re-establishment of the tree connect, and is not itself the problem. The
> server is simply timing out the treeid, since the client did not successfully
> reclaim it. The repeated STATUS_BAD_NETWORK_NAME is the issue.
>
> Are you sure the clustered server is recovering properly when you are
> forcing the failover? For example, if it's a two-node cluster, maybe node A
> can take over node B, but node B has issues taking over node A. Is there
> anything relevant in the server logs?
>
It's a two node cluster. The behaviour happens reliably when failing
over either way. After failover, the server state is consistent. E.g.
after a failover from node A to node B, node B shows itself as the
primary server and the node A is marked as down. I couldn't find
anything interesting in the server logs.
Thanks,
--
Ross Lagerwall
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Failure to reconnect after cluster failvoer
2019-02-25 13:13 ` Ross Lagerwall
@ 2019-02-27 14:16 ` Tom Talpey
0 siblings, 0 replies; 7+ messages in thread
From: Tom Talpey @ 2019-02-27 14:16 UTC (permalink / raw)
To: Ross Lagerwall, Steve French; +Cc: CIFS
> -----Original Message-----
> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> Sent: Monday, February 25, 2019 8:14 AM
> To: Tom Talpey <ttalpey@microsoft.com>; Steve French
> <smfrench@gmail.com>
> Cc: CIFS <linux-cifs@vger.kernel.org>
> Subject: Re: Failure to reconnect after cluster failvoer
>
> On 2/22/19 11:25 PM, Tom Talpey wrote:
> >> -----Original Message-----
> >> From: Ross Lagerwall <ross.lagerwall@citrix.com>
> >> Sent: Friday, February 22, 2019 9:17 AM
> >> To: Tom Talpey <ttalpey@microsoft.com>; Steve French
> >> <smfrench@gmail.com>
> >> Cc: CIFS <linux-cifs@vger.kernel.org>
> >> Subject: Re: Failure to reconnect after cluster failvoer
> >>
> >> On 2/21/19 5:59 PM, Tom Talpey wrote:
> >>> The reconnect is apparently using a dotted-quad as the servername, and
> you
> >> can see the auth is forced to NTLM as a consequence. Is that the way you
> >> initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)?
> >>>
> >>> -----Original Message-----
> >>> From: linux-cifs-owner@vger.kernel.org <linux-cifs-
> owner@vger.kernel.org>
> >> On Behalf Of Steve French
> >>> Sent: Thursday, February 21, 2019 9:07 AM
> >>> To: Ross Lagerwall <ross.lagerwall@citrix.com>
> >>> Cc: CIFS <linux-cifs@vger.kernel.org>
> >>> Subject: Re: Failure to reconnect after cluster failvoer
> >>>
> >>> Couple quick thoughts.
> >>>
> >>> Does this work on current kernels (5.0 for example).
> >>>
> >>> Was thinking about patches that might affect this like:
> >>> - "cifs: connect to servername instead of IP for IPC$ share"
> >>> - "smb3: on reconnect set PreviousSessionId field"
> >>> - Paulo's patches (has cifs-utils coreq) to reconnect to new IP
> >>> address if hostname's IP address changed and his add support for
> >>> failover
> >>> - Paulo's patch to remove trailing slashes from server UNC name
> >>>
> >> I've reproduced this with 5.0-rc7 and the latest cifs-utils from git.
> >> The share was mounted as follows (yes, by IP):
> >>
> >> mount.cifs -o
> >> vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z
> >> '//10.71.217.31/smbshare' /mnt
> >>
> >> Here is the tcpdump when it fails to reconnect properly:
> > ...
> >>
> >> The initial connection is at timestamp 0s, reconnection at 13s,
> >> STATUS_NETWORK_NAME_DELETED at 60s.
> >>
> >> For comparison, here is a tcpdump using the "fix" from my previous mail:
> > ...
> >>
> >> The initial connection is at timestamp 0s, reconnection at 34s,
> >> successful read request at 215s.
> >>
> >> Note that the tree connect for IPC$ only happens _after_ the tree
> >> connect for the share succeeds.
> >
> > Thanks for the full traces, they clarify the situation. But, I don’t see any
> > meaningful difference in the client behavior. The ordering of the two
> > treeconnects is the same between the two - initially, "IPC$" then
> > "smbshare", and on reconnect, the other way around. So, I'm unclear
> > whether your patch did anything.
>
> There is definitely a difference. Before the patch, on reconnect the client:
I'm still not so sure the difference is relevant. The timing is a bit different, but
in itself the IPC$ treeconnect isn't actually used, and in any case it succeeds
in both scenarios. So, I'm thinking it's either the timing, or coincidence.
> * Connects to "smbshare" which fails
> * Then connects to "IPC$" which succeeds
> * Then tries again to connect to smbshare which fails repeatedly
Here's what I see:
Event / timestamp / etc
Connection lost / 25.97 / Server sends many RST to client
Connection reestablished / 34.17
Treeconnect to smbshare / 34.17 / STATUS_B_N_N (retries with same result every 2 sec)
Treeconnect to IPC$ / 34.18 / success
Treeconnect to smbshare / 60.38 / STATUS_N_N_D (etc)
> After the patch, on reconnect the client:
>
> * Connects to "smbshare" which fails
> * Then tries again to connect to "smbshare" which succeeds after several
> retries
> * Then tries to connect to "IPC$" which succeeds
This time:
Connection lost / 9.81 / Server sends RST
Connection reestablished / 9.82 / status 0xc0000466 (some weird disk hardware status)
Connection lost / 13.53 / Server sends RST
Connection reestablished / 13.53
Treeconnect to smbshare / 13.63 / STATUS_B_N_N (retries with same result every 2 sec)
Treeconnect to smbshare / 43.90 / success (about 30 secs, 17 retries elapsed)
Treeconnect to IPC$ / 43.90 / success
So, the main effect of your patch is that the IPC$ attempt happens a lot *later*,
it certainly didn't affect the success of the smbshare treeconnect - it happened
only after that succeeded! And I don't see how deferring an unrelated treeconnect
would help that. I bet it would have the same result if the IPC$ didn't happen at
all.
I really think there's something wrong with your server, and not because of a bug.
Unfortunately both Steve and I are at FAST'19 and Vault here in Boston, so we're
not able to get much done. I'd love to understand this better, though...
Tom.
> This subtle reordering somehow makes it work. It may indeed be a server
> bug rather than a client bug. I was hoping someone could shed some light
> on this.
>
> >
> > The STATUS_NETWORK_NAME_DELETED is a consequence of the failed
> > re-establishment of the tree connect, and is not itself the problem. The
> > server is simply timing out the treeid, since the client did not successfully
> > reclaim it. The repeated STATUS_BAD_NETWORK_NAME is the issue.
> >
> > Are you sure the clustered server is recovering properly when you are
> > forcing the failover? For example, if it's a two-node cluster, maybe node A
> > can take over node B, but node B has issues taking over node A. Is there
> > anything relevant in the server logs?
> >
>
> It's a two node cluster. The behaviour happens reliably when failing
> over either way. After failover, the server state is consistent. E.g.
> after a failover from node A to node B, node B shows itself as the
> primary server and the node A is marked as down. I couldn't find
> anything interesting in the server logs.
>
> Thanks,
> --
> Ross Lagerwall
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-02-27 14:17 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-21 16:57 Failure to reconnect after cluster failvoer Ross Lagerwall
2019-02-21 17:06 ` Steve French
2019-02-21 17:59 ` Tom Talpey
2019-02-22 17:16 ` Ross Lagerwall
2019-02-22 23:25 ` Tom Talpey
2019-02-25 13:13 ` Ross Lagerwall
2019-02-27 14:16 ` Tom Talpey
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.