All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted
@ 2015-12-11 21:45 Scott Mayhew
  2015-12-11 21:45 ` [PATCH 1/3] sunrpc: Add a function to close temporary transports immediately Scott Mayhew
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Scott Mayhew @ 2015-12-11 21:45 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

A somewhat common configuration for highly available NFS v3 is to have nfsd and
lockd running at all times on the cluster nodes, and move the floating ip,
export configuration, and exported filesystem from one node to another when a
service failover or relocation occurs.

A problem arises in this sort of configuration though when an NFS service is
moved to another node and then moved back to the original node 'too quickly'
(i.e. before the original transport socket is closed on the first node).  When
this occurs, clients can experience delays that can last almost 15 minutes (2 *
svc_conn_age_period + time spent waiting in FIN_WAIT_1).  What happens is that
once the client reconnects to the original socket, the sequence numbers no
longer match up and bedlam ensues.
 
This isn't a new phenomenon -- slide 16 of this old presentation illustrates
the same scenario:
 
http://www.nfsv4bat.org/Documents/ConnectAThon/1996/nfstcp.pdf
 
One historical workaround was to set timeo=1 in the client's mount options.  The
reason the workaround worked is because once the client reconnects to the
original transport socket and the data stops moving,
we would start retransmitting at the RPC layer.  With the timeout set to 1/10 of
a second instead of the normal 60 seconds, the client's transport socket's send
buffer *much* more quickly, and once it filled up
there would a very good chance that an incomplete send would occur (from the
standpoint of the RPC layer -- at the network layer both sides are just spraying
ACKs at each other as fast as possible).  Once that happens, we would wind up
setting XPRT_CLOSE_WAIT in the client's rpc_xprt->state field in
xs_tcp_release_xprt() and on the next transmit the client would try to close the
connection.  Actually the FIN would get ignored by the server, again because the
sequence numbers were out of whack, so the client would wait for the FIN timeout
to expire, after which it would delete the socket, and upon receipt of the next
packet from the server to that port the client the client would respond with a
RST and things finally go back to normal.
 
That workaround used to work up until commit a9a6b52 (sunrpc: Dont start the
retransmission timer when out of socket space).  Now the client just waits for
its send buffer to empty out, which isn't going to happen in this scenario... so
we're back to waiting for the server's svc_serv->sv_temptimer aka
svc_age_temp_xprts() to do its thing.

These patches try to help that situation.  The first patch adds a function to
close temporary transports whose xpt_local matches the address passed in
server_addr immediately instead of waiting for them to be closed by the
svc_serv->sv_temptimer function.  The idea here is that if the ip address was
yanked out from under the service, then those transports are doomed and there's
no point in waiting up to 12 minutes to start cleaning them up.  The second
patch adds notifier_blocks (one for IPv4 and one for IPv6) to call that 
function to nfsd.  The third patch does the same thing, but for lockd.

I've been testing these patches on a RHEL 6 rgmanager cluster as well as a
Fedora 23 pacemaker cluster.  Note that the resource agents in pacemaker do not
behave the way I initially described... the pacemaker resource agents actually
do a full tear-down & bring up of the nfsd's as part of a service relocation, so
I hacked them up to behave like the older rgmanager agents in order to test.  I
tested with cthon and xfstests while moving the NFS service from one node to the
other every 60 seconds.  I also did more basic testing like taking & holding a
lock using the flock command from util-linux and making sure that the client was
able to reclaim the lock as I moved the service back and forth among the cluster
nodes.

For this to be effective, the clients still need to mount with a lower timeout,
but it doesn't need to be as aggressive as 1/10 of a second.

Also, for all this to work when the cluster nodes are running a firewall, it's
necessary to add a rule to trigger a RST.  The rule would need to be after the
rule that allows new NFS connections and before the catch-all rule that rejects
everyting else with ICMP-HOST-PROHIBITED.  For a Fedora server running
firewalld, the following commands accomplish that:

firewall-cmd --direct --add-passthrough ipv4 -A IN_FedoraServer_allow \
	-m tcp -p tcp --dport 2049 -j REJECT --reject-with tcp-reset
firewall-cmd --runtime-to-permanent

A similar rule would need to be added for whatever port lockd is running on as
well.

Scott Mayhew (3):
  sunrpc: Add a function to close temporary transports immediately
  nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain
  lockd: Register callbacks on the inetaddr_chain and inet6addr_chain

 fs/lockd/svc.c                  | 74 +++++++++++++++++++++++++++++++++++++++--
 fs/nfsd/nfssvc.c                | 68 +++++++++++++++++++++++++++++++++++++
 include/linux/sunrpc/svc_xprt.h |  1 +
 net/sunrpc/svc_xprt.c           | 45 +++++++++++++++++++++++++
 4 files changed, 186 insertions(+), 2 deletions(-)

-- 
2.4.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] sunrpc: Add a function to close temporary transports immediately
  2015-12-11 21:45 [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted Scott Mayhew
@ 2015-12-11 21:45 ` Scott Mayhew
  2015-12-11 21:45 ` [PATCH 2/3] nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain Scott Mayhew
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Scott Mayhew @ 2015-12-11 21:45 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

Add a function svc_age_temp_xprts_now() to close temporary transports
whose xpt_local matches the address passed in server_addr immediately
instead of waiting for them to be closed by the timer function.

The function is intended to be used by notifier_blocks that will be
added to nfsd and lockd that will run when an ip address is deleted.

This will eliminate the ACK storms and client hangs that occur in
HA-NFS configurations where nfsd & lockd is left running on the cluster
nodes all the time and the NFS 'service' is migrated back and forth
within a short timeframe.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 include/linux/sunrpc/svc_xprt.h |  1 +
 net/sunrpc/svc_xprt.c           | 45 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 78512cf..b7dabc4 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -128,6 +128,7 @@ struct	svc_xprt *svc_find_xprt(struct svc_serv *serv, const char *xcl_name,
 			const unsigned short port);
 int	svc_xprt_names(struct svc_serv *serv, char *buf, const int buflen);
 void	svc_add_new_perm_xprt(struct svc_serv *serv, struct svc_xprt *xprt);
+void	svc_age_temp_xprts_now(struct svc_serv *, struct sockaddr *);
 
 static inline void svc_xprt_get(struct svc_xprt *xprt)
 {
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index a6cbb21..7422f28 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -10,11 +10,13 @@
 #include <linux/kthread.h>
 #include <linux/slab.h>
 #include <net/sock.h>
+#include <linux/sunrpc/addr.h>
 #include <linux/sunrpc/stats.h>
 #include <linux/sunrpc/svc_xprt.h>
 #include <linux/sunrpc/svcsock.h>
 #include <linux/sunrpc/xprt.h>
 #include <linux/module.h>
+#include <linux/netdevice.h>
 #include <trace/events/sunrpc.h>
 
 #define RPCDBG_FACILITY	RPCDBG_SVCXPRT
@@ -938,6 +940,49 @@ static void svc_age_temp_xprts(unsigned long closure)
 	mod_timer(&serv->sv_temptimer, jiffies + svc_conn_age_period * HZ);
 }
 
+/* Close temporary transports whose xpt_local matches server_addr immediately
+ * instead of waiting for them to be picked up by the timer.
+ *
+ * This is meant to be called from a notifier_block that runs when an ip
+ * address is deleted.
+ */
+void svc_age_temp_xprts_now(struct svc_serv *serv, struct sockaddr *server_addr)
+{
+	struct svc_xprt *xprt;
+	struct svc_sock *svsk;
+	struct socket *sock;
+	struct list_head *le, *next;
+	LIST_HEAD(to_be_closed);
+	struct linger no_linger = {
+		.l_onoff = 1,
+		.l_linger = 0,
+	};
+
+	spin_lock_bh(&serv->sv_lock);
+	list_for_each_safe(le, next, &serv->sv_tempsocks) {
+		xprt = list_entry(le, struct svc_xprt, xpt_list);
+		if (rpc_cmp_addr(server_addr, (struct sockaddr *)
+				&xprt->xpt_local)) {
+			dprintk("svc_age_temp_xprts_now: found %p\n", xprt);
+			list_move(le, &to_be_closed);
+		}
+	}
+	spin_unlock_bh(&serv->sv_lock);
+
+	while (!list_empty(&to_be_closed)) {
+		le = to_be_closed.next;
+		list_del_init(le);
+		xprt = list_entry(le, struct svc_xprt, xpt_list);
+		dprintk("svc_age_temp_xprts_now: closing %p\n", xprt);
+		svsk = container_of(xprt, struct svc_sock, sk_xprt);
+		sock = svsk->sk_sock;
+		kernel_setsockopt(sock, SOL_SOCKET, SO_LINGER,
+				  (char *)&no_linger, sizeof(no_linger));
+		svc_close_xprt(xprt);
+	}
+}
+EXPORT_SYMBOL_GPL(svc_age_temp_xprts_now);
+
 static void call_xpt_users(struct svc_xprt *xprt)
 {
 	struct svc_xpt_user *u;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain
  2015-12-11 21:45 [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted Scott Mayhew
  2015-12-11 21:45 ` [PATCH 1/3] sunrpc: Add a function to close temporary transports immediately Scott Mayhew
@ 2015-12-11 21:45 ` Scott Mayhew
  2015-12-11 21:46 ` [PATCH 3/3] lockd: " Scott Mayhew
  2015-12-17 19:57 ` [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted J. Bruce Fields
  3 siblings, 0 replies; 9+ messages in thread
From: Scott Mayhew @ 2015-12-11 21:45 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

Register callbacks on inetaddr_chain and inet6addr_chain to trigger
cleanup of nfsd transport sockets when an ip address is deleted.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 fs/nfsd/nfssvc.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index ad4e237..3779a5f 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -14,9 +14,13 @@
 
 #include <linux/sunrpc/stats.h>
 #include <linux/sunrpc/svcsock.h>
+#include <linux/sunrpc/svc_xprt.h>
 #include <linux/lockd/bind.h>
 #include <linux/nfsacl.h>
 #include <linux/seq_file.h>
+#include <linux/inetdevice.h>
+#include <net/addrconf.h>
+#include <net/ipv6.h>
 #include <net/net_namespace.h>
 #include "nfsd.h"
 #include "cache.h"
@@ -306,10 +310,70 @@ static void nfsd_shutdown_net(struct net *net)
 	nfsd_shutdown_generic();
 }
 
+static int nfsd_inetaddr_event(struct notifier_block *this, unsigned long event,
+	void *ptr)
+{
+	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
+	struct net_device *dev = ifa->ifa_dev->dev;
+	struct net *net = dev_net(dev);
+	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+	struct sockaddr_in sin;
+
+	if (event != NETDEV_DOWN)
+		goto out;
+
+	if (nn->nfsd_serv) {
+		dprintk("nfsd_inetaddr_event: removed %pI4\n", &ifa->ifa_local);
+		sin.sin_family = AF_INET;
+		sin.sin_addr.s_addr = ifa->ifa_local;
+		svc_age_temp_xprts_now(nn->nfsd_serv, (struct sockaddr *)&sin);
+	}
+
+out:
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block nfsd_inetaddr_notifier = {
+	.notifier_call = nfsd_inetaddr_event,
+};
+
+#if IS_ENABLED(CONFIG_IPV6)
+static int nfsd_inet6addr_event(struct notifier_block *this,
+	unsigned long event, void *ptr)
+{
+	struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
+	struct net_device *dev = ifa->idev->dev;
+	struct net *net = dev_net(dev);
+	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+	struct sockaddr_in6 sin6;
+
+	if (event != NETDEV_DOWN)
+		goto out;
+
+	if (nn->nfsd_serv) {
+		dprintk("nfsd_inet6addr_event: removed %pI6\n", &ifa->addr);
+		sin6.sin6_family = AF_INET6;
+		sin6.sin6_addr = ifa->addr;
+		svc_age_temp_xprts_now(nn->nfsd_serv, (struct sockaddr *)&sin6);
+	}
+
+out:
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block nfsd_inet6addr_notifier = {
+	.notifier_call = nfsd_inet6addr_event,
+};
+#endif
+
 static void nfsd_last_thread(struct svc_serv *serv, struct net *net)
 {
 	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
 
+	unregister_inetaddr_notifier(&nfsd_inetaddr_notifier);
+#if IS_ENABLED(CONFIG_IPV6)
+	unregister_inet6addr_notifier(&nfsd_inet6addr_notifier);
+#endif
 	/*
 	 * write_ports can create the server without actually starting
 	 * any threads--if we get shut down before any threads are
@@ -425,6 +489,10 @@ int nfsd_create_serv(struct net *net)
 	}
 
 	set_max_drc();
+	register_inetaddr_notifier(&nfsd_inetaddr_notifier);
+#if IS_ENABLED(CONFIG_IPV6)
+	register_inet6addr_notifier(&nfsd_inet6addr_notifier);
+#endif
 	do_gettimeofday(&nn->nfssvc_boot);		/* record boot time */
 	return 0;
 }
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] lockd: Register callbacks on the inetaddr_chain and inet6addr_chain
  2015-12-11 21:45 [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted Scott Mayhew
  2015-12-11 21:45 ` [PATCH 1/3] sunrpc: Add a function to close temporary transports immediately Scott Mayhew
  2015-12-11 21:45 ` [PATCH 2/3] nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain Scott Mayhew
@ 2015-12-11 21:46 ` Scott Mayhew
  2015-12-17 19:57 ` [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted J. Bruce Fields
  3 siblings, 0 replies; 9+ messages in thread
From: Scott Mayhew @ 2015-12-11 21:46 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

Register callbacks on inetaddr_chain and inet6addr_chain to trigger
cleanup of lockd transport sockets when an ip address is deleted.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 fs/lockd/svc.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 5f31ebd..44d18ad 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -25,13 +25,17 @@
 #include <linux/mutex.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
+#include <linux/inetdevice.h>
 
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/stats.h>
 #include <linux/sunrpc/clnt.h>
 #include <linux/sunrpc/svc.h>
 #include <linux/sunrpc/svcsock.h>
+#include <linux/sunrpc/svc_xprt.h>
 #include <net/ip.h>
+#include <net/addrconf.h>
+#include <net/ipv6.h>
 #include <linux/lockd/lockd.h>
 #include <linux/nfs.h>
 
@@ -279,6 +283,68 @@ static void lockd_down_net(struct svc_serv *serv, struct net *net)
 	}
 }
 
+static int lockd_inetaddr_event(struct notifier_block *this,
+	unsigned long event, void *ptr)
+{
+	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
+	struct sockaddr_in sin;
+
+	if (event != NETDEV_DOWN)
+		goto out;
+
+	if (nlmsvc_rqst) {
+		dprintk("lockd_inetaddr_event: removed %pI4\n",
+			&ifa->ifa_local);
+		sin.sin_family = AF_INET;
+		sin.sin_addr.s_addr = ifa->ifa_local;
+		svc_age_temp_xprts_now(nlmsvc_rqst->rq_server,
+			(struct sockaddr *)&sin);
+	}
+
+out:
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block lockd_inetaddr_notifier = {
+	.notifier_call = lockd_inetaddr_event,
+};
+
+#if IS_ENABLED(CONFIG_IPV6)
+static int lockd_inet6addr_event(struct notifier_block *this,
+	unsigned long event, void *ptr)
+{
+	struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
+	struct sockaddr_in6 sin6;
+
+	if (event != NETDEV_DOWN)
+		goto out;
+
+	if (nlmsvc_rqst) {
+		dprintk("lockd_inet6addr_event: removed %pI6\n", &ifa->addr);
+		sin6.sin6_family = AF_INET6;
+		sin6.sin6_addr = ifa->addr;
+		svc_age_temp_xprts_now(nlmsvc_rqst->rq_server,
+			(struct sockaddr *)&sin6);
+	}
+
+out:
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block lockd_inet6addr_notifier = {
+	.notifier_call = lockd_inet6addr_event,
+};
+#endif
+
+static void lockd_svc_exit_thread(void)
+{
+	unregister_inetaddr_notifier(&lockd_inetaddr_notifier);
+#if IS_ENABLED(CONFIG_IPV6)
+	unregister_inet6addr_notifier(&lockd_inet6addr_notifier);
+#endif
+	svc_exit_thread(nlmsvc_rqst);
+}
+
 static int lockd_start_svc(struct svc_serv *serv)
 {
 	int error;
@@ -315,7 +381,7 @@ static int lockd_start_svc(struct svc_serv *serv)
 	return 0;
 
 out_task:
-	svc_exit_thread(nlmsvc_rqst);
+	lockd_svc_exit_thread();
 	nlmsvc_task = NULL;
 out_rqst:
 	nlmsvc_rqst = NULL;
@@ -360,6 +426,10 @@ static struct svc_serv *lockd_create_svc(void)
 		printk(KERN_WARNING "lockd_up: create service failed\n");
 		return ERR_PTR(-ENOMEM);
 	}
+	register_inetaddr_notifier(&lockd_inetaddr_notifier);
+#if IS_ENABLED(CONFIG_IPV6)
+	register_inet6addr_notifier(&lockd_inet6addr_notifier);
+#endif
 	dprintk("lockd_up: service created\n");
 	return serv;
 }
@@ -428,7 +498,7 @@ lockd_down(struct net *net)
 	}
 	kthread_stop(nlmsvc_task);
 	dprintk("lockd_down: service stopped\n");
-	svc_exit_thread(nlmsvc_rqst);
+	lockd_svc_exit_thread();
 	dprintk("lockd_down: service destroyed\n");
 	nlmsvc_task = NULL;
 	nlmsvc_rqst = NULL;
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted
  2015-12-11 21:45 [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted Scott Mayhew
                   ` (2 preceding siblings ...)
  2015-12-11 21:46 ` [PATCH 3/3] lockd: " Scott Mayhew
@ 2015-12-17 19:57 ` J. Bruce Fields
  2015-12-17 22:17   ` J. Bruce Fields
  2015-12-18 13:55   ` Scott Mayhew
  3 siblings, 2 replies; 9+ messages in thread
From: J. Bruce Fields @ 2015-12-17 19:57 UTC (permalink / raw)
  To: Scott Mayhew; +Cc: linux-nfs

On Fri, Dec 11, 2015 at 04:45:57PM -0500, Scott Mayhew wrote:
> A somewhat common configuration for highly available NFS v3 is to have nfsd and
> lockd running at all times on the cluster nodes, and move the floating ip,
> export configuration, and exported filesystem from one node to another when a
> service failover or relocation occurs.
> 
> A problem arises in this sort of configuration though when an NFS service is
> moved to another node and then moved back to the original node 'too quickly'
> (i.e. before the original transport socket is closed on the first node).  When
> this occurs, clients can experience delays that can last almost 15 minutes (2 *
> svc_conn_age_period + time spent waiting in FIN_WAIT_1).  What happens is that
> once the client reconnects to the original socket, the sequence numbers no
> longer match up and bedlam ensues.
>  
> This isn't a new phenomenon -- slide 16 of this old presentation illustrates
> the same scenario:
>  
> http://www.nfsv4bat.org/Documents/ConnectAThon/1996/nfstcp.pdf
>  
> One historical workaround was to set timeo=1 in the client's mount options.  The
> reason the workaround worked is because once the client reconnects to the
> original transport socket and the data stops moving,
> we would start retransmitting at the RPC layer.  With the timeout set to 1/10 of
> a second instead of the normal 60 seconds, the client's transport socket's send
> buffer *much* more quickly, and once it filled up
> there would a very good chance that an incomplete send would occur (from the
> standpoint of the RPC layer -- at the network layer both sides are just spraying
> ACKs at each other as fast as possible).  Once that happens, we would wind up
> setting XPRT_CLOSE_WAIT in the client's rpc_xprt->state field in
> xs_tcp_release_xprt() and on the next transmit the client would try to close the
> connection.  Actually the FIN would get ignored by the server, again because the
> sequence numbers were out of whack, so the client would wait for the FIN timeout
> to expire, after which it would delete the socket, and upon receipt of the next
> packet from the server to that port the client the client would respond with a
> RST and things finally go back to normal.
>  
> That workaround used to work up until commit a9a6b52 (sunrpc: Dont start the
> retransmission timer when out of socket space).  Now the client just waits for
> its send buffer to empty out, which isn't going to happen in this scenario... so
> we're back to waiting for the server's svc_serv->sv_temptimer aka
> svc_age_temp_xprts() to do its thing.
> 
> These patches try to help that situation.  The first patch adds a function to
> close temporary transports whose xpt_local matches the address passed in
> server_addr immediately instead of waiting for them to be closed by the
> svc_serv->sv_temptimer function.  The idea here is that if the ip address was
> yanked out from under the service, then those transports are doomed and there's
> no point in waiting up to 12 minutes to start cleaning them up.  The second
> patch adds notifier_blocks (one for IPv4 and one for IPv6) to call that 
> function to nfsd.  The third patch does the same thing, but for lockd.
> 
> I've been testing these patches on a RHEL 6 rgmanager cluster as well as a
> Fedora 23 pacemaker cluster.  Note that the resource agents in pacemaker do not
> behave the way I initially described... the pacemaker resource agents actually
> do a full tear-down & bring up of the nfsd's as part of a service relocation, so
> I hacked them up to behave like the older rgmanager agents in order to test.  I
> tested with cthon and xfstests while moving the NFS service from one node to the
> other every 60 seconds.  I also did more basic testing like taking & holding a
> lock using the flock command from util-linux and making sure that the client was
> able to reclaim the lock as I moved the service back and forth among the cluster
> nodes.
> 
> For this to be effective, the clients still need to mount with a lower timeout,
> but it doesn't need to be as aggressive as 1/10 of a second.

That's just to prevent a file operation hanging too long in the case
that nfsd or ip shutdown prevents the client getting a reply?

> Also, for all this to work when the cluster nodes are running a firewall, it's
> necessary to add a rule to trigger a RST.  The rule would need to be after the
> rule that allows new NFS connections and before the catch-all rule that rejects
> everyting else with ICMP-HOST-PROHIBITED.  For a Fedora server running
> firewalld, the following commands accomplish that:
> 
> firewall-cmd --direct --add-passthrough ipv4 -A IN_FedoraServer_allow \
> 	-m tcp -p tcp --dport 2049 -j REJECT --reject-with tcp-reset
> firewall-cmd --runtime-to-permanent

To make sure I understand: so in the absence of the firewall, the
client's packets arrive at a server that doesn't see them as belonging
to any connection, so it replies with a RST.  In the presence of the
firewall, the packets are rejected before they get to that point, so
there's no RST, so we need this rule to trigger the RST instead.  Is
that right?

--b.

> 
> A similar rule would need to be added for whatever port lockd is running on as
> well.
> 
> Scott Mayhew (3):
>   sunrpc: Add a function to close temporary transports immediately
>   nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain
>   lockd: Register callbacks on the inetaddr_chain and inet6addr_chain
> 
>  fs/lockd/svc.c                  | 74 +++++++++++++++++++++++++++++++++++++++--
>  fs/nfsd/nfssvc.c                | 68 +++++++++++++++++++++++++++++++++++++
>  include/linux/sunrpc/svc_xprt.h |  1 +
>  net/sunrpc/svc_xprt.c           | 45 +++++++++++++++++++++++++
>  4 files changed, 186 insertions(+), 2 deletions(-)
> 
> -- 
> 2.4.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted
  2015-12-17 19:57 ` [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted J. Bruce Fields
@ 2015-12-17 22:17   ` J. Bruce Fields
  2015-12-18 13:57     ` Scott Mayhew
  2015-12-18 13:55   ` Scott Mayhew
  1 sibling, 1 reply; 9+ messages in thread
From: J. Bruce Fields @ 2015-12-17 22:17 UTC (permalink / raw)
  To: Scott Mayhew; +Cc: linux-nfs

On Thu, Dec 17, 2015 at 02:57:08PM -0500, J. Bruce Fields wrote:
> On Fri, Dec 11, 2015 at 04:45:57PM -0500, Scott Mayhew wrote:
> > A somewhat common configuration for highly available NFS v3 is to have nfsd and
> > lockd running at all times on the cluster nodes, and move the floating ip,
> > export configuration, and exported filesystem from one node to another when a
> > service failover or relocation occurs.
> > 
> > A problem arises in this sort of configuration though when an NFS service is
> > moved to another node and then moved back to the original node 'too quickly'
> > (i.e. before the original transport socket is closed on the first node).  When
> > this occurs, clients can experience delays that can last almost 15 minutes (2 *
> > svc_conn_age_period + time spent waiting in FIN_WAIT_1).  What happens is that
> > once the client reconnects to the original socket, the sequence numbers no
> > longer match up and bedlam ensues.
> >  
> > This isn't a new phenomenon -- slide 16 of this old presentation illustrates
> > the same scenario:
> >  
> > http://www.nfsv4bat.org/Documents/ConnectAThon/1996/nfstcp.pdf
> >  
> > One historical workaround was to set timeo=1 in the client's mount options.  The
> > reason the workaround worked is because once the client reconnects to the
> > original transport socket and the data stops moving,
> > we would start retransmitting at the RPC layer.  With the timeout set to 1/10 of
> > a second instead of the normal 60 seconds, the client's transport socket's send
> > buffer *much* more quickly, and once it filled up
> > there would a very good chance that an incomplete send would occur (from the
> > standpoint of the RPC layer -- at the network layer both sides are just spraying
> > ACKs at each other as fast as possible).  Once that happens, we would wind up
> > setting XPRT_CLOSE_WAIT in the client's rpc_xprt->state field in
> > xs_tcp_release_xprt() and on the next transmit the client would try to close the
> > connection.  Actually the FIN would get ignored by the server, again because the
> > sequence numbers were out of whack, so the client would wait for the FIN timeout
> > to expire, after which it would delete the socket, and upon receipt of the next
> > packet from the server to that port the client the client would respond with a
> > RST and things finally go back to normal.
> >  
> > That workaround used to work up until commit a9a6b52 (sunrpc: Dont start the
> > retransmission timer when out of socket space).  Now the client just waits for
> > its send buffer to empty out, which isn't going to happen in this scenario... so
> > we're back to waiting for the server's svc_serv->sv_temptimer aka
> > svc_age_temp_xprts() to do its thing.
> > 
> > These patches try to help that situation.  The first patch adds a function to
> > close temporary transports whose xpt_local matches the address passed in
> > server_addr immediately instead of waiting for them to be closed by the
> > svc_serv->sv_temptimer function.  The idea here is that if the ip address was
> > yanked out from under the service, then those transports are doomed and there's
> > no point in waiting up to 12 minutes to start cleaning them up.  The second
> > patch adds notifier_blocks (one for IPv4 and one for IPv6) to call that 
> > function to nfsd.  The third patch does the same thing, but for lockd.
> > 
> > I've been testing these patches on a RHEL 6 rgmanager cluster as well as a
> > Fedora 23 pacemaker cluster.  Note that the resource agents in pacemaker do not
> > behave the way I initially described... the pacemaker resource agents actually
> > do a full tear-down & bring up of the nfsd's as part of a service relocation, so
> > I hacked them up to behave like the older rgmanager agents in order to test.  I
> > tested with cthon and xfstests while moving the NFS service from one node to the
> > other every 60 seconds.  I also did more basic testing like taking & holding a
> > lock using the flock command from util-linux and making sure that the client was
> > able to reclaim the lock as I moved the service back and forth among the cluster
> > nodes.
> > 
> > For this to be effective, the clients still need to mount with a lower timeout,
> > but it doesn't need to be as aggressive as 1/10 of a second.
> 
> That's just to prevent a file operation hanging too long in the case
> that nfsd or ip shutdown prevents the client getting a reply?
> 
> > Also, for all this to work when the cluster nodes are running a firewall, it's
> > necessary to add a rule to trigger a RST.  The rule would need to be after the
> > rule that allows new NFS connections and before the catch-all rule that rejects
> > everyting else with ICMP-HOST-PROHIBITED.  For a Fedora server running
> > firewalld, the following commands accomplish that:
> > 
> > firewall-cmd --direct --add-passthrough ipv4 -A IN_FedoraServer_allow \
> > 	-m tcp -p tcp --dport 2049 -j REJECT --reject-with tcp-reset
> > firewall-cmd --runtime-to-permanent
> 
> To make sure I understand: so in the absence of the firewall, the
> client's packets arrive at a server that doesn't see them as belonging
> to any connection, so it replies with a RST.  In the presence of the
> firewall, the packets are rejected before they get to that point, so
> there's no RST, so we need this rule to trigger the RST instead.  Is
> that right?

By the way it might be nice to capture this in the kernel source
someplace.   Maybe just drop some version of the above text in a new
file named Documentation/filesystems/nfs/nfs-server-ha.txt or something
similar?

Anyway, the patches look OK to me.  I'll queue them up for 4.5 if
there's no objections.

--b.

--b.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted
  2015-12-17 19:57 ` [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted J. Bruce Fields
  2015-12-17 22:17   ` J. Bruce Fields
@ 2015-12-18 13:55   ` Scott Mayhew
  2015-12-18 14:54     ` J. Bruce Fields
  1 sibling, 1 reply; 9+ messages in thread
From: Scott Mayhew @ 2015-12-18 13:55 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

On Thu, 17 Dec 2015, J. Bruce Fields wrote:

> On Fri, Dec 11, 2015 at 04:45:57PM -0500, Scott Mayhew wrote:
> > A somewhat common configuration for highly available NFS v3 is to have nfsd and
> > lockd running at all times on the cluster nodes, and move the floating ip,
> > export configuration, and exported filesystem from one node to another when a
> > service failover or relocation occurs.
> > 
> > A problem arises in this sort of configuration though when an NFS service is
> > moved to another node and then moved back to the original node 'too quickly'
> > (i.e. before the original transport socket is closed on the first node).  When
> > this occurs, clients can experience delays that can last almost 15 minutes (2 *
> > svc_conn_age_period + time spent waiting in FIN_WAIT_1).  What happens is that
> > once the client reconnects to the original socket, the sequence numbers no
> > longer match up and bedlam ensues.
> >  
> > This isn't a new phenomenon -- slide 16 of this old presentation illustrates
> > the same scenario:
> >  
> > http://www.nfsv4bat.org/Documents/ConnectAThon/1996/nfstcp.pdf
> >  
> > One historical workaround was to set timeo=1 in the client's mount options.  The
> > reason the workaround worked is because once the client reconnects to the
> > original transport socket and the data stops moving,
> > we would start retransmitting at the RPC layer.  With the timeout set to 1/10 of
> > a second instead of the normal 60 seconds, the client's transport socket's send
> > buffer *much* more quickly, and once it filled up
> > there would a very good chance that an incomplete send would occur (from the
> > standpoint of the RPC layer -- at the network layer both sides are just spraying
> > ACKs at each other as fast as possible).  Once that happens, we would wind up
> > setting XPRT_CLOSE_WAIT in the client's rpc_xprt->state field in
> > xs_tcp_release_xprt() and on the next transmit the client would try to close the
> > connection.  Actually the FIN would get ignored by the server, again because the
> > sequence numbers were out of whack, so the client would wait for the FIN timeout
> > to expire, after which it would delete the socket, and upon receipt of the next
> > packet from the server to that port the client the client would respond with a
> > RST and things finally go back to normal.
> >  
> > That workaround used to work up until commit a9a6b52 (sunrpc: Dont start the
> > retransmission timer when out of socket space).  Now the client just waits for
> > its send buffer to empty out, which isn't going to happen in this scenario... so
> > we're back to waiting for the server's svc_serv->sv_temptimer aka
> > svc_age_temp_xprts() to do its thing.
> > 
> > These patches try to help that situation.  The first patch adds a function to
> > close temporary transports whose xpt_local matches the address passed in
> > server_addr immediately instead of waiting for them to be closed by the
> > svc_serv->sv_temptimer function.  The idea here is that if the ip address was
> > yanked out from under the service, then those transports are doomed and there's
> > no point in waiting up to 12 minutes to start cleaning them up.  The second
> > patch adds notifier_blocks (one for IPv4 and one for IPv6) to call that 
> > function to nfsd.  The third patch does the same thing, but for lockd.
> > 
> > I've been testing these patches on a RHEL 6 rgmanager cluster as well as a
> > Fedora 23 pacemaker cluster.  Note that the resource agents in pacemaker do not
> > behave the way I initially described... the pacemaker resource agents actually
> > do a full tear-down & bring up of the nfsd's as part of a service relocation, so
> > I hacked them up to behave like the older rgmanager agents in order to test.  I
> > tested with cthon and xfstests while moving the NFS service from one node to the
> > other every 60 seconds.  I also did more basic testing like taking & holding a
> > lock using the flock command from util-linux and making sure that the client was
> > able to reclaim the lock as I moved the service back and forth among the cluster
> > nodes.
> > 
> > For this to be effective, the clients still need to mount with a lower timeout,
> > but it doesn't need to be as aggressive as 1/10 of a second.
> 
> That's just to prevent a file operation hanging too long in the case
> that nfsd or ip shutdown prevents the client getting a reply?

That statement was based on early testing actually.  I went on to test 
with timeouts of 3, 10, 30, and 60 seconds, and it no longer appeared to
make a difference.  I just forgot to remove that from my final cover
letter.

> 
> > Also, for all this to work when the cluster nodes are running a firewall, it's
> > necessary to add a rule to trigger a RST.  The rule would need to be after the
> > rule that allows new NFS connections and before the catch-all rule that rejects
> > everyting else with ICMP-HOST-PROHIBITED.  For a Fedora server running
> > firewalld, the following commands accomplish that:
> > 
> > firewall-cmd --direct --add-passthrough ipv4 -A IN_FedoraServer_allow \
> > 	-m tcp -p tcp --dport 2049 -j REJECT --reject-with tcp-reset
> > firewall-cmd --runtime-to-permanent
> 
> To make sure I understand: so in the absence of the firewall, the
> client's packets arrive at a server that doesn't see them as belonging
> to any connection, so it replies with a RST.  In the presence of the
> firewall, the packets are rejected before they get to that point, so
> there's no RST, so we need this rule to trigger the RST instead.  Is
> that right?

That's correct.

-Scott

> 
> --b.
> 
> > 
> > A similar rule would need to be added for whatever port lockd is running on as
> > well.
> > 
> > Scott Mayhew (3):
> >   sunrpc: Add a function to close temporary transports immediately
> >   nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain
> >   lockd: Register callbacks on the inetaddr_chain and inet6addr_chain
> > 
> >  fs/lockd/svc.c                  | 74 +++++++++++++++++++++++++++++++++++++++--
> >  fs/nfsd/nfssvc.c                | 68 +++++++++++++++++++++++++++++++++++++
> >  include/linux/sunrpc/svc_xprt.h |  1 +
> >  net/sunrpc/svc_xprt.c           | 45 +++++++++++++++++++++++++
> >  4 files changed, 186 insertions(+), 2 deletions(-)
> > 
> > -- 
> > 2.4.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted
  2015-12-17 22:17   ` J. Bruce Fields
@ 2015-12-18 13:57     ` Scott Mayhew
  0 siblings, 0 replies; 9+ messages in thread
From: Scott Mayhew @ 2015-12-18 13:57 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

On Thu, 17 Dec 2015, J. Bruce Fields wrote:

> On Thu, Dec 17, 2015 at 02:57:08PM -0500, J. Bruce Fields wrote:
> > On Fri, Dec 11, 2015 at 04:45:57PM -0500, Scott Mayhew wrote:
> > > A somewhat common configuration for highly available NFS v3 is to have nfsd and
> > > lockd running at all times on the cluster nodes, and move the floating ip,
> > > export configuration, and exported filesystem from one node to another when a
> > > service failover or relocation occurs.
> > > 
> > > A problem arises in this sort of configuration though when an NFS service is
> > > moved to another node and then moved back to the original node 'too quickly'
> > > (i.e. before the original transport socket is closed on the first node).  When
> > > this occurs, clients can experience delays that can last almost 15 minutes (2 *
> > > svc_conn_age_period + time spent waiting in FIN_WAIT_1).  What happens is that
> > > once the client reconnects to the original socket, the sequence numbers no
> > > longer match up and bedlam ensues.
> > >  
> > > This isn't a new phenomenon -- slide 16 of this old presentation illustrates
> > > the same scenario:
> > >  
> > > http://www.nfsv4bat.org/Documents/ConnectAThon/1996/nfstcp.pdf
> > >  
> > > One historical workaround was to set timeo=1 in the client's mount options.  The
> > > reason the workaround worked is because once the client reconnects to the
> > > original transport socket and the data stops moving,
> > > we would start retransmitting at the RPC layer.  With the timeout set to 1/10 of
> > > a second instead of the normal 60 seconds, the client's transport socket's send
> > > buffer *much* more quickly, and once it filled up
> > > there would a very good chance that an incomplete send would occur (from the
> > > standpoint of the RPC layer -- at the network layer both sides are just spraying
> > > ACKs at each other as fast as possible).  Once that happens, we would wind up
> > > setting XPRT_CLOSE_WAIT in the client's rpc_xprt->state field in
> > > xs_tcp_release_xprt() and on the next transmit the client would try to close the
> > > connection.  Actually the FIN would get ignored by the server, again because the
> > > sequence numbers were out of whack, so the client would wait for the FIN timeout
> > > to expire, after which it would delete the socket, and upon receipt of the next
> > > packet from the server to that port the client the client would respond with a
> > > RST and things finally go back to normal.
> > >  
> > > That workaround used to work up until commit a9a6b52 (sunrpc: Dont start the
> > > retransmission timer when out of socket space).  Now the client just waits for
> > > its send buffer to empty out, which isn't going to happen in this scenario... so
> > > we're back to waiting for the server's svc_serv->sv_temptimer aka
> > > svc_age_temp_xprts() to do its thing.
> > > 
> > > These patches try to help that situation.  The first patch adds a function to
> > > close temporary transports whose xpt_local matches the address passed in
> > > server_addr immediately instead of waiting for them to be closed by the
> > > svc_serv->sv_temptimer function.  The idea here is that if the ip address was
> > > yanked out from under the service, then those transports are doomed and there's
> > > no point in waiting up to 12 minutes to start cleaning them up.  The second
> > > patch adds notifier_blocks (one for IPv4 and one for IPv6) to call that 
> > > function to nfsd.  The third patch does the same thing, but for lockd.
> > > 
> > > I've been testing these patches on a RHEL 6 rgmanager cluster as well as a
> > > Fedora 23 pacemaker cluster.  Note that the resource agents in pacemaker do not
> > > behave the way I initially described... the pacemaker resource agents actually
> > > do a full tear-down & bring up of the nfsd's as part of a service relocation, so
> > > I hacked them up to behave like the older rgmanager agents in order to test.  I
> > > tested with cthon and xfstests while moving the NFS service from one node to the
> > > other every 60 seconds.  I also did more basic testing like taking & holding a
> > > lock using the flock command from util-linux and making sure that the client was
> > > able to reclaim the lock as I moved the service back and forth among the cluster
> > > nodes.
> > > 
> > > For this to be effective, the clients still need to mount with a lower timeout,
> > > but it doesn't need to be as aggressive as 1/10 of a second.
> > 
> > That's just to prevent a file operation hanging too long in the case
> > that nfsd or ip shutdown prevents the client getting a reply?
> > 
> > > Also, for all this to work when the cluster nodes are running a firewall, it's
> > > necessary to add a rule to trigger a RST.  The rule would need to be after the
> > > rule that allows new NFS connections and before the catch-all rule that rejects
> > > everyting else with ICMP-HOST-PROHIBITED.  For a Fedora server running
> > > firewalld, the following commands accomplish that:
> > > 
> > > firewall-cmd --direct --add-passthrough ipv4 -A IN_FedoraServer_allow \
> > > 	-m tcp -p tcp --dport 2049 -j REJECT --reject-with tcp-reset
> > > firewall-cmd --runtime-to-permanent
> > 
> > To make sure I understand: so in the absence of the firewall, the
> > client's packets arrive at a server that doesn't see them as belonging
> > to any connection, so it replies with a RST.  In the presence of the
> > firewall, the packets are rejected before they get to that point, so
> > there's no RST, so we need this rule to trigger the RST instead.  Is
> > that right?
> 
> By the way it might be nice to capture this in the kernel source
> someplace.   Maybe just drop some version of the above text in a new
> file named Documentation/filesystems/nfs/nfs-server-ha.txt or something
> similar?

Sure, I can work on something after the holidays.

> 
> Anyway, the patches look OK to me.  I'll queue them up for 4.5 if
> there's no objections.

Thanks!

-Scott


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted
  2015-12-18 13:55   ` Scott Mayhew
@ 2015-12-18 14:54     ` J. Bruce Fields
  0 siblings, 0 replies; 9+ messages in thread
From: J. Bruce Fields @ 2015-12-18 14:54 UTC (permalink / raw)
  To: Scott Mayhew; +Cc: linux-nfs

On Fri, Dec 18, 2015 at 08:55:41AM -0500, Scott Mayhew wrote:
> On Thu, 17 Dec 2015, J. Bruce Fields wrote:
> 
> > On Fri, Dec 11, 2015 at 04:45:57PM -0500, Scott Mayhew wrote:
> > > For this to be effective, the clients still need to mount with a lower timeout,
> > > but it doesn't need to be as aggressive as 1/10 of a second.
> > 
> > That's just to prevent a file operation hanging too long in the case
> > that nfsd or ip shutdown prevents the client getting a reply?
> 
> That statement was based on early testing actually.  I went on to test 
> with timeouts of 3, 10, 30, and 60 seconds, and it no longer appeared to
> make a difference.  I just forgot to remove that from my final cover
> letter.

OK.  Though the window to hit the lost-reply case might be very small,
I'm not sure.

--b.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-12-18 14:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-11 21:45 [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted Scott Mayhew
2015-12-11 21:45 ` [PATCH 1/3] sunrpc: Add a function to close temporary transports immediately Scott Mayhew
2015-12-11 21:45 ` [PATCH 2/3] nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain Scott Mayhew
2015-12-11 21:46 ` [PATCH 3/3] lockd: " Scott Mayhew
2015-12-17 19:57 ` [PATCH 0/3] Add notifier blocks to close transport sockets when an ip address is deleted J. Bruce Fields
2015-12-17 22:17   ` J. Bruce Fields
2015-12-18 13:57     ` Scott Mayhew
2015-12-18 13:55   ` Scott Mayhew
2015-12-18 14:54     ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.