How to automatically drop unresponsive CIFS /SMB connections

linux-cifs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* How to automatically drop unresponsive CIFS /SMB connections
@ 2024-02-03 22:48 R. Diez
  2024-02-04 23:26 ` Lucy Kueny
  0 siblings, 1 reply; 5+ messages in thread
From: R. Diez @ 2024-02-03 22:48 UTC (permalink / raw)
  To: linux-cifs

Hi all:

I have been mounting Windows shares for years with this script, which just boils down to "sudo mount -t cifs":

https://github.com/rdiez/Tools/blob/master/MountWindowsShares/mount-windows-shares-sudo.sh

I noticed under Linux that some applications (like Emacs), the desktop's file manager (like Caja) or even the whole desktop sometimes hang for a number of seconds. It is very annoying. It turns out the reason is that the hanging software is trying to look at a file or a directory on an unresponsive CIFS / SMB mount.

The easiest way to reproduce this issue is from outside the office: I start the VPN, connect to the Windows shares, and then tear down the VPN.

I have tried mount option "echo_interval=4", but that does not really help. The Kernel does seem to notice more quickly that the connection has become unresponsive:

Feb 03 23:24:37 rdiez4 kernel: CIFS: VFS: \\192.168.1.3 has not responded in 12 seconds. Reconnecting...

The trouble is, it tries to reconnect automatically. That means that the next application which attempts to access something under the unresponsive mount will hang again. I think the pauses last 10 seconds, it must be hard-coded in the CIFS Kernel code. If the application retries itself, or tries to look at more than 1 file before failing the whole operation, then the time adds up accordingly. If the shell's current directory is on such a failing path, it bugs you for a while.

What I need is for the connection to automatically drop when it becomes unresponsive, and do not retry to connect again.

Alternatively, applications should fail immediately if a connection has been deemed unresponsive in the meantime, and hasn't been successfully re-established yet.

Is there a way to achieve that behaviour?

Thanks in advance,
   rdiez

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to automatically drop unresponsive CIFS /SMB connections
  2024-02-03 22:48 How to automatically drop unresponsive CIFS /SMB connections R. Diez
@ 2024-02-04 23:26 ` Lucy Kueny
  2024-02-05  9:07   ` R. Diez
  0 siblings, 1 reply; 5+ messages in thread
From: Lucy Kueny @ 2024-02-04 23:26 UTC (permalink / raw)
  To: R. Diez, linux-cifs

On 03/02/2024 23:48, R. Diez wrote:
> Hi all:
> 
> I have been mounting Windows shares for years with this script, which
> just boils down to "sudo mount -t cifs":
> 
> https://github.com/rdiez/Tools/blob/master/MountWindowsShares/mount-windows-shares-sudo.sh
> 
> I noticed under Linux that some applications (like Emacs), the desktop's
> file manager (like Caja) or even the whole desktop sometimes hang for a
> number of seconds. It is very annoying. It turns out the reason is that
> the hanging software is trying to look at a file or a directory on an
> unresponsive CIFS / SMB mount.
> 
> The easiest way to reproduce this issue is from outside the office: I
> start the VPN, connect to the Windows shares, and then tear down the VPN.
> 
> I have tried mount option "echo_interval=4", but that does not really
> help. The Kernel does seem to notice more quickly that the connection
> has become unresponsive:
> 
> Feb 03 23:24:37 rdiez4 kernel: CIFS: VFS: \\192.168.1.3 has not
> responded in 12 seconds. Reconnecting...
> 
> The trouble is, it tries to reconnect automatically. That means that the
> next application which attempts to access something under the
> unresponsive mount will hang again. I think the pauses last 10 seconds,
> it must be hard-coded in the CIFS Kernel code. If the application
> retries itself, or tries to look at more than 1 file before failing the
> whole operation, then the time adds up accordingly. If the shell's
> current directory is on such a failing path, it bugs you for a while.
> 
> What I need is for the connection to automatically drop when it becomes
> unresponsive, and do not retry to connect again.
> 
> Alternatively, applications should fail immediately if a connection has
> been deemed unresponsive in the meantime, and hasn't been successfully
> re-established yet.
> 
> Is there a way to achieve that behaviour?
> 
> Thanks in advance,
>   rdiez
> 

Hi everyone,

I have written a patch that does this. It adds a mount flag to return as unavailable immediately after N reconnect attempts.
It's written against Linux 6.7 but still applies on cifs-2.6. I asked the same question on this mailing list a while ago.

Add "max_blocking_recon=1" to your mount arguments. I run it on my machines.
It probably needs polishing from somebody more experienced than me.


Best regards,
Lucy Kueny



From 98e2e44d39f4f5172e3ce416a2e65a48b51e2de1 Mon Sep 17 00:00:00 2001
From: Lucy Kueny <lucy@kueny.fr>
Date: Fri, 22 Sep 2023 11:06:20 +0200
Subject: [PATCH] Stop reconnect timeouts from freezing userspace

---
 fs/smb/client/cifsfs.c     |  3 +++
 fs/smb/client/cifsglob.h   |  6 ++++++
 fs/smb/client/connect.c    |  2 ++
 fs/smb/client/fs_context.c |  6 ++++++
 fs/smb/client/fs_context.h |  2 ++
 fs/smb/client/misc.c       | 13 +++++++++++++
 6 files changed, 32 insertions(+)

diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c
index 22869cda1356..ea338b335074 100644
--- a/fs/smb/client/cifsfs.c
+++ b/fs/smb/client/cifsfs.c
@@ -694,6 +694,9 @@ cifs_show_options(struct seq_file *s, struct dentry *root)
 		seq_puts(s, ",noblocksend");
 	if (tcon->ses->server->nosharesock)
 		seq_puts(s, ",nosharesock");
+	if (tcon->ses->server->max_blocking_reconnect != DEFAULT_MAX_BLOCKING_RECONNECT)
+		seq_printf(s, ",max_blocking_reconnect=%lu",
+			   tcon->ses->server->max_blocking_reconnect);
 
 	if (tcon->snapshot_time)
 		seq_printf(s, ",snapshot=%llu", tcon->snapshot_time);
diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h
index 032d8716f671..5128123148e1 100644
--- a/fs/smb/client/cifsglob.h
+++ b/fs/smb/client/cifsglob.h
@@ -84,6 +84,10 @@
 /* maximum number of PDUs in one compound */
 #define MAX_COMPOUND 5
 
+/* maximum failed reconnects before file access fails without waiting */
+#define DEFAULT_MAX_BLOCKING_RECONNECT 0
+
+
 /*
  * Default number of credits to keep available for SMB3.
  * This value is chosen somewhat arbitrarily. The Windows client
@@ -731,6 +735,8 @@ struct TCP_Server_Info {
 	struct delayed_work reconnect; /* reconnect workqueue job */
 	struct mutex reconnect_mutex; /* prevent simultaneous reconnects */
 	unsigned long echo_interval;
+	unsigned long max_blocking_reconnect; /* maximum failed reconnects before file access fails without waiting */
+	unsigned long reconnect_fail_cnt; /* subsequent reconnect timeout on file access */
 
 	/*
 	 * Number of targets available for reconnect. The more targets
diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
index 687754791bf0..999f87633baa 100644
--- a/fs/smb/client/connect.c
+++ b/fs/smb/client/connect.c
@@ -1740,6 +1740,8 @@ cifs_get_tcp_session(struct smb3_fs_context *ctx,
 			goto out_err_crypto_release;
 		}
 	}
+	tcp_ses->max_blocking_reconnect = ctx->max_blocking_reconnect;
+	tcp_ses->reconnect_fail_cnt = 0;
 	rc = ip_connect(tcp_ses);
 	if (rc < 0) {
 		cifs_dbg(VFS, "Error connecting to socket. Aborting operation.\n");
diff --git a/fs/smb/client/fs_context.c b/fs/smb/client/fs_context.c
index e45ce31bbda7..0ae441c97bff 100644
--- a/fs/smb/client/fs_context.c
+++ b/fs/smb/client/fs_context.c
@@ -154,6 +154,7 @@ const struct fs_parameter_spec smb3_fs_parameters[] = {
 	fsparam_u32("handletimeout", Opt_handletimeout),
 	fsparam_u64("snapshot", Opt_snapshot),
 	fsparam_u32("max_channels", Opt_max_channels),
+	fsparam_u32("max_blocking_recon", Opt_max_blocking_reconnect),
 
 	/* Mount options which take string value */
 	fsparam_string("source", Opt_source),
@@ -1166,6 +1167,9 @@ static int smb3_fs_context_parse_param(struct fs_context *fc,
 		if (result.uint_32 > 1)
 			ctx->multichannel = true;
 		break;
+	case Opt_max_blocking_reconnect:
+		ctx->max_blocking_reconnect = result.uint_32;
+		break;
 	case Opt_max_cached_dirs:
 		if (result.uint_32 < 1) {
 			cifs_errorf(fc, "%s: Invalid max_cached_dirs, needs to be 1 or more\n",
@@ -1615,6 +1619,8 @@ int smb3_init_fs_context(struct fs_context *fc)
 	ctx->multichannel = false;
 	ctx->max_channels = 1;
 
+	ctx->max_blocking_reconnect = DEFAULT_MAX_BLOCKING_RECONNECT;
+
 	ctx->backupuid_specified = false; /* no backup intent for a user */
 	ctx->backupgid_specified = false; /* no backup intent for a group */
 
diff --git a/fs/smb/client/fs_context.h b/fs/smb/client/fs_context.h
index 9d8d34af0211..478b3a9d3af5 100644
--- a/fs/smb/client/fs_context.h
+++ b/fs/smb/client/fs_context.h
@@ -131,6 +131,7 @@ enum cifs_param {
 	Opt_max_cached_dirs,
 	Opt_snapshot,
 	Opt_max_channels,
+	Opt_max_blocking_reconnect,
 	Opt_handletimeout,
 
 	/* Mount options which take string value */
@@ -262,6 +263,7 @@ struct smb3_fs_context {
 	__u32 handle_timeout; /* persistent and durable handle timeout in ms */
 	unsigned int max_credits; /* smb3 max_credits 10 < credits < 60000 */
 	unsigned int max_channels;
+	unsigned int max_blocking_reconnect;
 	unsigned int max_cached_dirs;
 	__u16 compression; /* compression algorithm 0xFFFF default 0=disabled */
 	bool rootfs:1; /* if it's a SMB root file system */
diff --git a/fs/smb/client/misc.c b/fs/smb/client/misc.c
index 366b755ca913..51320ec6b08a 100644
--- a/fs/smb/client/misc.c
+++ b/fs/smb/client/misc.c
@@ -1318,6 +1318,13 @@ int cifs_wait_for_server_reconnect(struct TCP_Server_Info *server, bool retry)
 		return 0;
 	}
 	timeout *= server->nr_targets;
+	/* return immediatly on repeated timeouts */
+	if (server->max_blocking_reconnect &&
+		server->reconnect_fail_cnt >= server->max_blocking_reconnect) {
+		spin_unlock(&server->srv_lock);
+		cifs_dbg(FYI, "%s: not waiting for reconnect as requested\n", __func__);
+		return -EHOSTDOWN;
+	}
 	spin_unlock(&server->srv_lock);
 
 	/*
@@ -1341,12 +1348,18 @@ int cifs_wait_for_server_reconnect(struct TCP_Server_Info *server, bool retry)
 		/* are we still trying to reconnect? */
 		spin_lock(&server->srv_lock);
 		if (server->tcpStatus != CifsNeedReconnect) {
+			server->reconnect_fail_cnt = 0;
 			spin_unlock(&server->srv_lock);
 			return 0;
 		}
 		spin_unlock(&server->srv_lock);
 	} while (retry);
 
+	/* increase failed attempt counter */
+	spin_lock(&server->srv_lock);
+	server->reconnect_fail_cnt += 1;
+	spin_unlock(&server->srv_lock);
+
 	cifs_dbg(FYI, "%s: gave up waiting on reconnect\n", __func__);
 	return -EHOSTDOWN;
 }
-- 
2.42.0





^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: How to automatically drop unresponsive CIFS /SMB connections
  2024-02-04 23:26 ` Lucy Kueny
@ 2024-02-05  9:07   ` R. Diez
  2024-02-05 12:10     ` Lucy Kueny
  0 siblings, 1 reply; 5+ messages in thread
From: R. Diez @ 2024-02-05  9:07 UTC (permalink / raw)
  To: Lucy Kueny; +Cc: linux-cifs

Hallo Lucy:
> I have written a patch that does this.

Many thanks for confirming that this problem exists and bugs other people too.

Unfortunately, I lack the time and the skills to apply Kernel patches to my Linux PCs.

> It adds a mount flag to return as unavailable immediately after N reconnect attempts.
> [...]

I wonder whether the approach you followed is ideal. I do not know much about this area, so I might be wrong.

The first issue is that there appears to be no documentation about the Linux Kernel CIFS client's behaviour on connection timeout. I find this frustrating. I have done some research in the past, search for 'echo_interval' here:

https://github.com/rdiez/Tools/blob/master/MountWindowsShares/mount-windows-shares-sudo.sh

Based on my empirical research, there is a fixed 10-second timeout when reconnecting. Therefore, I would rather have an error returned as soon as the connection has been marked as lost by means of 'echo_interval', instead of after attempting to reconnect.

This may be hard to achieve if the reconnection only happens single threaded and on demand (when a process is trying to access a file under the mount point), instead of automatically on the background.

I would also welcome a configurable connection timeout. I normally set echo_interval to 4 seconds, as 8 seconds is long enough to declare a connection unresponsive. In fact, even 8 seconds is rather high with today's networks and servers. If the connection timeout is 10 seconds, that means an application may take up to 18 seconds the first time to "unfreeze", and 10 seconds every time afterwards, which I still find too long.

Let's hope that the CIFS guys take this problem seriously. The Linux desktop freezing for many seconds in a row is probably putting off many users.

Best regards,
   rdiez

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to automatically drop unresponsive CIFS /SMB connections
  2024-02-05  9:07   ` R. Diez
@ 2024-02-05 12:10     ` Lucy Kueny
  2024-02-05 15:22       ` R. Diez
  0 siblings, 1 reply; 5+ messages in thread
From: Lucy Kueny @ 2024-02-05 12:10 UTC (permalink / raw)
  To: R. Diez, linux-cifs

Hello rdiez,

> Many thanks for confirming that this problem exists and bugs other people too.

> Let's hope that the CIFS guys take this problem seriously. The Linux desktop freezing for many seconds in a row is probably putting off many users.

On my desktop, the UI can freeze for hours and require a reboot from TTY. The 'recent files' submenu in software seems to trigger repeated connection attempts.
This is a major usability issue, and probably why FUSE is used by KDE.

> The first issue is that there appears to be no documentation about the Linux Kernel CIFS client's behaviour on connection timeout. I find this frustrating. I have done some research in the past, search for 'echo_interval' here:

> Based on my empirical research, there is a fixed 10-second timeout when reconnecting. Therefore, I would rather have an error returned as soon as the connection has been marked as lost by means of 'echo_interval', instead of after attempting to reconnect.

> This may be hard to achieve if the reconnection only happens single threaded and on demand (when a process is trying to access a file under the mount point), instead of automatically on the background.

As I understand it, the connection is managed in a separate thread.
The 10-second timeout depends on multiple smaller timeouts in different parts of the kernel. It's almost impossible to change.

I believe the echo_interval triggers will cause a reconnect attempt to happen, thus "freezes" will be gone after echo_interval+10 seconds.

It's not the best, but it allows the use of a permanent mount to a NAS without trashing userspace if the network goes out of range.

Best regards,
Lucy Kueny

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to automatically drop unresponsive CIFS /SMB connections
  2024-02-05 12:10     ` Lucy Kueny
@ 2024-02-05 15:22       ` R. Diez
  0 siblings, 0 replies; 5+ messages in thread
From: R. Diez @ 2024-02-05 15:22 UTC (permalink / raw)
  To: Lucy Kueny; +Cc: linux-cifs

> On my desktop, the UI can freeze for hours and require a reboot from TTY.
> The 'recent files' submenu in software seems to trigger repeated connection attempts.
> This is a major usability issue, and probably why FUSE is used by KDE.
> [...]

You are right, a lost SMB connection can take a long time to recover from. I just made the mistake again of switching off my Windows 10 PC without severing an existing SMB connection from my Linux laptop. It took me several minutes to tear the connection down from a terminal window. I wasn't even using the connection actively, but I had a Caja window (MATE's file manager) open on that mount, and "umount -t cifs" kept complaining that the connection was still in use, so it refused to close it.

I even clicked on NetworkManager's "Enable networking" option, in order to disable network support completely, in the hope that this way all internal state machines would timeout or fail immediately, to no avail.

Now that you talk about FUSE, I tried using GVfs for a while, but it is full of long-standing issues too. I kept some notes about it inside the comments of this script:

https://github.com/rdiez/Tools/blob/master/MountWindowsShares/mount-windows-shares-gvfs.sh

I wonder whether the KDE way is better, and how I could use it myself. I have had many issues with KDE over the years, so a long time ago I decided to stop using it. At the moment, I sway between Xfce and MATE, as both still have their share of problems, but that is a different subject altogether.

I have heard that KDE has its own "KIO Slaves", which would be the equivalent to GNOME's GVfs. Do you know if it is really more reliable for CIFS / SMB connections? Can you install and use it without installing the whole KDE desktop?

I also wonder about installing an SSHFS server on the Windows boxes. I tried Cygwin's SSH server on Windows 7, and it worked rather well, but I haven't tried its SSHFS support yet. Modern versions of Windows bring their own SSH server, I wonder if that would be more reliable for Linux clients, and whether it integrates with Windows security (so that you do not have to distribute SSH keys around).

Best regards,
   rdiez

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-02-05 15:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-03 22:48 How to automatically drop unresponsive CIFS /SMB connections R. Diez
2024-02-04 23:26 ` Lucy Kueny
2024-02-05  9:07   ` R. Diez
2024-02-05 12:10     ` Lucy Kueny
2024-02-05 15:22       ` R. Diez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).