All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements
@ 2021-01-21 19:10 Dan Aloni
  2021-01-21 19:10 ` [PATCH v1 1/5] sunrpc: Allow specifying a vector of IP addresses for nconnect Dan Aloni
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Dan Aloni @ 2021-01-21 19:10 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust, Anna Schumaker

Hi,

The purpose of the following changes is to allow specifying multiple
target IP addresses in a single mount. Combining this with nconnect and
servers that support exposing multiple ports, we can achieve load
balancing and much greater throughput, especially on RDMA setups,
even with the older NFSv3 protocol.

The changes allow specifing a new `remoteports=<IP-addresses-ranges>`
mount option providing a group of IP addresses, from which `nconnect` at
sunrpc scope picks target transport address in round-robin. There's also
an accompanying `localports` parameter that allows local address bind so
that the source port is better controlled in a way to ensure that
transports are not hogging a single local interface.

This patchset targets the linux-next tree.

Dan Aloni (5):
  sunrpc: Allow specifying a vector of IP addresses for nconnect
  xprtrdma: Bind to a local address if requested
  nfs: Extend nconnect with remoteports and localports mount params
  sunrpc: Add srcaddr to xprt sysfs debug
  nfs: Increase NFS_MAX_CONNECTIONS

 fs/nfs/client.c                            |  24 +++
 fs/nfs/fs_context.c                        | 173 ++++++++++++++++++++-
 fs/nfs/internal.h                          |   4 +
 include/linux/nfs_fs_sb.h                  |   2 +
 include/linux/sunrpc/clnt.h                |   9 ++
 include/linux/sunrpc/xprt.h                |   1 +
 net/sunrpc/clnt.c                          |  47 ++++++
 net/sunrpc/debugfs.c                       |   8 +-
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   2 +-
 net/sunrpc/xprtrdma/transport.c            |  17 +-
 net/sunrpc/xprtrdma/verbs.c                |  15 +-
 net/sunrpc/xprtrdma/xprt_rdma.h            |   5 +-
 net/sunrpc/xprtsock.c                      |  49 +++---
 13 files changed, 329 insertions(+), 27 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v1 1/5] sunrpc: Allow specifying a vector of IP addresses for nconnect
  2021-01-21 19:10 [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Dan Aloni
@ 2021-01-21 19:10 ` Dan Aloni
  2021-01-21 19:10 ` [PATCH v1 2/5] xprtrdma: Bind to a local address if requested Dan Aloni
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Dan Aloni @ 2021-01-21 19:10 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust, Anna Schumaker

This adds an `rpc_portgroup` structure to describe a group of IP
addresses comprising one logical server with multiple ports. The remote
endpoint can be in a single server exposing multiple network interfaces,
or multiple remote machines implementing a distributed server architecture.

Combined with nconnect, the multiple transports try to make use of the
multiple addresses given so that the connections are spread across the
ports.

Signed-off-by: Dan Aloni <dan@kernelim.com>
---
 include/linux/sunrpc/clnt.h |  9 +++++++
 net/sunrpc/clnt.c           | 47 +++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 02e7a5863d28..f8c0c33281a8 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -115,12 +115,21 @@ struct rpc_procinfo {
 	const char *		p_name;		/* name of procedure */
 };
 
+#define RPC_MAX_PORTS 64
+
+struct rpc_portgroup {
+	int nr;
+	struct sockaddr_storage addrs[RPC_MAX_PORTS];
+};
+
 struct rpc_create_args {
 	struct net		*net;
 	int			protocol;
 	struct sockaddr		*address;
 	size_t			addrsize;
 	struct sockaddr		*saddress;
+	struct rpc_portgroup    *localports;    /* additional local addresses */
+	struct rpc_portgroup    *remoteports;   /* additional remote addresses */
 	const struct rpc_timeout *timeout;
 	const char		*servername;
 	const char		*nodename;
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 612f0a641f4c..5f335a873f03 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -500,6 +500,44 @@ static struct rpc_clnt *rpc_create_xprt(struct rpc_create_args *args,
 	return clnt;
 }
 
+struct rpc_clnt_portgroup_iter {
+	struct rpc_portgroup *pg;
+	int idx;
+};
+
+static void take_iter_portgroup_addr(struct rpc_clnt_portgroup_iter *iter,
+				struct sockaddr	**address)
+{
+	struct rpc_portgroup *pg = iter->pg;
+	struct sockaddr	*existing = *address;
+	struct sockaddr	*new;
+
+	if (!pg || pg->nr == 0)
+		return;
+	if (iter->idx >= pg->nr)
+		iter->idx = 0;
+
+	/* Take port from existing address, or use autobind (0) */
+	new = (struct sockaddr *)&pg->addrs[iter->idx++];
+
+	switch (new->sa_family) {
+	case AF_INET:
+		((struct sockaddr_in *)new)->sin_port =
+			existing ? ((struct sockaddr_in *)existing)->sin_port
+			         : 0;
+		break;
+	case AF_INET6:
+		((struct sockaddr_in6 *)new)->sin6_port =
+			existing ? ((struct sockaddr_in6 *)existing)->sin6_port
+			         : 0;
+		break;
+	default:
+		return;
+	}
+
+	*address = new;
+}
+
 /**
  * rpc_create - create an RPC client and transport with one call
  * @args: rpc_clnt create argument structure
@@ -522,6 +560,8 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
 		.servername = args->servername,
 		.bc_xprt = args->bc_xprt,
 	};
+	struct rpc_clnt_portgroup_iter iter_localports = { .pg = args->localports };
+	struct rpc_clnt_portgroup_iter iter_remoteports = { .pg = args->remoteports };
 	char servername[48];
 	struct rpc_clnt *clnt;
 	int i;
@@ -573,6 +613,10 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
 		xprtargs.servername = servername;
 	}
 
+	/* If localports or remoteports are specified, first entry overrides */
+	take_iter_portgroup_addr(&iter_localports, &xprtargs.srcaddr);
+	take_iter_portgroup_addr(&iter_remoteports, &xprtargs.dstaddr);
+
 	xprt = xprt_create_transport(&xprtargs);
 	if (IS_ERR(xprt))
 		return (struct rpc_clnt *)xprt;
@@ -595,6 +639,9 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args)
 		return clnt;
 
 	for (i = 0; i < args->nconnect - 1; i++) {
+		take_iter_portgroup_addr(&iter_localports, &xprtargs.srcaddr);
+		take_iter_portgroup_addr(&iter_remoteports, &xprtargs.dstaddr);
+
 		if (rpc_clnt_add_xprt(clnt, &xprtargs, NULL, NULL) < 0)
 			break;
 	}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v1 2/5] xprtrdma: Bind to a local address if requested
  2021-01-21 19:10 [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Dan Aloni
  2021-01-21 19:10 ` [PATCH v1 1/5] sunrpc: Allow specifying a vector of IP addresses for nconnect Dan Aloni
@ 2021-01-21 19:10 ` Dan Aloni
  2021-01-21 19:10 ` [PATCH v1 3/5] nfs: Extend nconnect with remoteports and localports mount params Dan Aloni
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Dan Aloni @ 2021-01-21 19:10 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust, Anna Schumaker

Until now, rpcrdma did not make use of the local address when binding a
QP. For subnets where the local machine has multiple IP addresses that
are all connected to the same subnet, it may be desired to tell from
which interface a QP is going out.

Signed-off-by: Dan Aloni <dan@kernelim.com>
---
 net/sunrpc/xprtrdma/transport.c |  7 +++++++
 net/sunrpc/xprtrdma/verbs.c     | 15 ++++++++++++++-
 net/sunrpc/xprtrdma/xprt_rdma.h |  2 ++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 78d29d1bcc20..45726ab5f13a 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -355,6 +355,13 @@ xprt_setup_rdma(struct xprt_create *args)
 	xprt_rdma_format_addresses(xprt, sap);
 
 	new_xprt = rpcx_to_rdmax(xprt);
+
+	/* Copy source address if specified */
+	if (args->srcaddr) {
+		new_xprt->rx_has_srcaddr = true;
+		memcpy(&new_xprt->rx_saddr, args->srcaddr, args->addrlen);
+	}
+
 	rc = rpcrdma_buffer_create(new_xprt);
 	if (rc) {
 		xprt_rdma_free_addresses(xprt);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ec912cf9c618..64476161cf92 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -314,6 +314,7 @@ static struct rdma_cm_id *rpcrdma_create_id(struct rpcrdma_xprt *r_xprt,
 	unsigned long wtimeout = msecs_to_jiffies(RDMA_RESOLVE_TIMEOUT) + 1;
 	struct rpc_xprt *xprt = &r_xprt->rx_xprt;
 	struct rdma_cm_id *id;
+	struct sockaddr *saddr = NULL;
 	int rc;
 
 	init_completion(&ep->re_done);
@@ -324,7 +325,19 @@ static struct rdma_cm_id *rpcrdma_create_id(struct rpcrdma_xprt *r_xprt,
 		return id;
 
 	ep->re_async_rc = -ETIMEDOUT;
-	rc = rdma_resolve_addr(id, NULL, (struct sockaddr *)&xprt->addr,
+	if (r_xprt->rx_has_srcaddr) {
+		char buf[0x50] = {0, };
+		saddr = (struct sockaddr *)&r_xprt->rx_saddr;
+
+		rpc_ntop(saddr, buf, sizeof(buf));
+
+		dprintk("xprt=%p, source address port %s\n", xprt, buf);
+	}
+
+	dprintk("xprt=%p, %s:%s resolving addr\n", xprt,
+		rpcrdma_addrstr(r_xprt), rpcrdma_portstr(r_xprt));
+
+	rc = rdma_resolve_addr(id, saddr, (struct sockaddr *)&xprt->addr,
 			       RDMA_RESOLVE_TIMEOUT);
 	if (rc)
 		goto out;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 94b28657aeeb..cb4539d4740a 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -421,6 +421,8 @@ struct rpcrdma_stats {
  */
 struct rpcrdma_xprt {
 	struct rpc_xprt		rx_xprt;
+	struct sockaddr_storage rx_saddr;
+	bool			rx_has_srcaddr;
 	struct rpcrdma_ep	*rx_ep;
 	struct rpcrdma_buffer	rx_buf;
 	struct delayed_work	rx_connect_worker;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v1 3/5] nfs: Extend nconnect with remoteports and localports mount params
  2021-01-21 19:10 [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Dan Aloni
  2021-01-21 19:10 ` [PATCH v1 1/5] sunrpc: Allow specifying a vector of IP addresses for nconnect Dan Aloni
  2021-01-21 19:10 ` [PATCH v1 2/5] xprtrdma: Bind to a local address if requested Dan Aloni
@ 2021-01-21 19:10 ` Dan Aloni
  2021-01-21 19:10 ` [PATCH v1 4/5] sunrpc: Add srcaddr to xprt sysfs debug Dan Aloni
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Dan Aloni @ 2021-01-21 19:10 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust, Anna Schumaker

The new added mount parameters allow passing a vector of IP addresses to
be used with the extra transports that nconnect creates. The remoteports
parameter provides the destination addresses, and localports specifies
local address binds.

Signed-off-by: Dan Aloni <dan@kernelim.com>
---
 fs/nfs/client.c           |  24 ++++++
 fs/nfs/fs_context.c       | 171 ++++++++++++++++++++++++++++++++++++++
 fs/nfs/internal.h         |   4 +
 include/linux/nfs_fs_sb.h |   2 +
 4 files changed, 201 insertions(+)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index ff5c4d0d6d13..3560817ab5c4 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -166,6 +166,18 @@ struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_init)
 
 	memcpy(&clp->cl_addr, cl_init->addr, cl_init->addrlen);
 	clp->cl_addrlen = cl_init->addrlen;
+	if (cl_init->localports) {
+		clp->cl_localports = vmalloc(sizeof(*cl_init->localports));
+		if (!clp->cl_localports)
+			goto error_cleanup;
+		*clp->cl_localports = *cl_init->localports;
+	}
+	if (cl_init->remoteports) {
+		clp->cl_remoteports = vmalloc(sizeof(*cl_init->remoteports));
+		if (!clp->cl_remoteports)
+			goto error_cleanup;
+		*clp->cl_remoteports = *cl_init->remoteports;
+	}
 
 	if (cl_init->hostname) {
 		err = -ENOMEM;
@@ -187,6 +199,10 @@ struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_init)
 	return clp;
 
 error_cleanup:
+	if (clp->cl_remoteports)
+		vfree(clp->cl_remoteports);
+	if (clp->cl_localports)
+		vfree(clp->cl_localports);
 	put_nfs_version(clp->cl_nfs_mod);
 error_dealloc:
 	kfree(clp);
@@ -245,6 +261,10 @@ void nfs_free_client(struct nfs_client *clp)
 
 	put_net(clp->cl_net);
 	put_nfs_version(clp->cl_nfs_mod);
+	if (clp->cl_localports)
+		vfree(clp->cl_localports);
+	if (clp->cl_remoteports)
+		vfree(clp->cl_remoteports);
 	kfree(clp->cl_hostname);
 	kfree(clp->cl_acceptor);
 	kfree(clp);
@@ -508,6 +528,8 @@ int nfs_create_rpc_client(struct nfs_client *clp,
 		.nconnect	= clp->cl_nconnect,
 		.address	= (struct sockaddr *)&clp->cl_addr,
 		.addrsize	= clp->cl_addrlen,
+		.localports	= clp->cl_localports,
+		.remoteports	= clp->cl_remoteports,
 		.timeout	= cl_init->timeparms,
 		.servername	= clp->cl_hostname,
 		.nodename	= cl_init->nodename,
@@ -678,6 +700,8 @@ static int nfs_init_server(struct nfs_server *server,
 		.timeparms = &timeparms,
 		.cred = server->cred,
 		.nconnect = ctx->nfs_server.nconnect,
+		.localports = ctx->localports,
+		.remoteports = ctx->remoteports,
 		.init_flags = (1UL << NFS_CS_REUSEPORT),
 	};
 	struct nfs_client *clp;
diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
index 06894bcdea2d..3d41ba61b26d 100644
--- a/fs/nfs/fs_context.c
+++ b/fs/nfs/fs_context.c
@@ -49,6 +49,7 @@ enum nfs_param {
 	Opt_hard,
 	Opt_intr,
 	Opt_local_lock,
+	Opt_localports,
 	Opt_lock,
 	Opt_lookupcache,
 	Opt_migration,
@@ -65,6 +66,7 @@ enum nfs_param {
 	Opt_proto,
 	Opt_rdirplus,
 	Opt_rdma,
+	Opt_remoteports,
 	Opt_resvport,
 	Opt_retrans,
 	Opt_retry,
@@ -134,6 +136,7 @@ static const struct fs_parameter_spec nfs_fs_parameters[] = {
 		  fs_param_neg_with_no|fs_param_deprecated, NULL),
 	fsparam_enum  ("local_lock",	Opt_local_lock, nfs_param_enums_local_lock),
 	fsparam_flag_no("lock",		Opt_lock),
+	fsparam_string("localports",	Opt_localports),
 	fsparam_enum  ("lookupcache",	Opt_lookupcache, nfs_param_enums_lookupcache),
 	fsparam_flag_no("migration",	Opt_migration),
 	fsparam_u32   ("minorversion",	Opt_minorversion),
@@ -150,6 +153,7 @@ static const struct fs_parameter_spec nfs_fs_parameters[] = {
 	fsparam_string("proto",		Opt_proto),
 	fsparam_flag_no("rdirplus",	Opt_rdirplus),
 	fsparam_flag  ("rdma",		Opt_rdma),
+	fsparam_string("remoteports",	Opt_remoteports),
 	fsparam_flag_no("resvport",	Opt_resvport),
 	fsparam_u32   ("retrans",	Opt_retrans),
 	fsparam_string("retry",		Opt_retry),
@@ -430,6 +434,146 @@ static int nfs_parse_version_string(struct fs_context *fc,
 	return 0;
 }
 
+static int nfs_portgroup_add_parsed(struct fs_context *fc, struct rpc_portgroup *pg,
+				     const char *type, struct sockaddr_storage *addr,
+				     int len)
+{
+	if (pg->nr >= RPC_MAX_PORTS) {
+		nfs_invalf(fc, "NFS: portgroup for %s is too large, reached %d items",
+			   type, pg->nr);
+		return -ENOSPC;
+	}
+
+	if (pg->nr > 0 && pg->addrs[0].ss_family != addr->ss_family) {
+		nfs_invalf(fc, "NFS: all portgroup addresses must be of the same family");
+		return -EINVAL;
+	}
+
+	pg->addrs[pg->nr] = *addr;
+	pg->nr++;
+
+	return 0;
+}
+
+/*
+ * Parse a single address and add to portgroup.
+ */
+static int nfs_portgroup_add_single(struct fs_context *fc, struct rpc_portgroup *pg,
+				     const char *type, const char *single)
+{
+	struct sockaddr_storage addr;
+	size_t len = rpc_pton(fc->net_ns, single, strlen(single),
+			      (struct sockaddr *)&addr, sizeof(addr));
+
+	if (len == 0) {
+		nfs_invalf(fc, "NFS: portgroup for %s, unable to parse address %s",
+			   type, single);
+		return -EINVAL;
+	}
+
+	return nfs_portgroup_add_parsed(fc, pg, type, &addr, len);
+}
+
+/*
+ * Parse and add a portgroup address range. This is an inclusive address range
+ * that is delimited by '-', e.g. '192.168.0.1-192.168.0.16'.
+ */
+static int nfs_portgroup_add_range(struct fs_context *fc, struct rpc_portgroup *pg,
+				    const char *type, const char *begin, const char *end)
+{
+	struct sockaddr_storage addr;
+	struct sockaddr_storage end_addr;
+	int ret;
+	size_t len = rpc_pton(fc->net_ns, begin, strlen(begin),
+			      (struct sockaddr *)&addr, sizeof(addr)), end_len;
+
+	if (len == 0) {
+		nfs_invalf(fc, "NFS: portgroup for %s, unable to parse address %s",
+			   type, begin);
+		return -EINVAL;
+	}
+
+	end_len = rpc_pton(fc->net_ns, end, strlen(end),
+			   (struct sockaddr *)&end_addr, sizeof(end_addr));
+
+	if (end_len == 0) {
+		nfs_invalf(fc, "NFS: portgroup for %s, unable to parse address %s",
+			   type, end);
+		return -EINVAL;
+	}
+
+	while (0 == (ret = nfs_portgroup_add_parsed(fc, pg, type, &addr, len))) {
+		/* Check if end of range reached */
+		if (rpc_cmp_addr((const struct sockaddr *)&addr,
+				 (const struct sockaddr *)&end_addr))
+			break;
+
+		/* Bump address by one */
+		switch (addr.ss_family) {
+		case AF_INET: {
+			struct sockaddr_in *sin1 = (struct sockaddr_in *)&addr;
+			sin1->sin_addr.s_addr = htonl(ntohl(sin1->sin_addr.s_addr) + 1);
+			break;
+		}
+		case AF_INET6: {
+			nfs_invalf(fc, "NFS: IPv6 in address ranges not supported");
+			return -ENOTSUPP;
+		}
+		default:
+			nfs_invalf(fc, "NFS: address family %d not supported in ranges",
+				   addr.ss_family);
+			return -ENOTSUPP;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Parse and add a portgroup string. These are `~`-delimited single addresses
+ * or groups of addresses. An inclusive address range can be specified with '-'
+ * instead of specifying a single address.
+ */
+static int nfs_parse_portgroup(char *string, struct fs_context *fc,
+				struct rpc_portgroup **pg_out, const char *type)
+{
+	struct rpc_portgroup *pg = NULL;
+	char *string_scan = string, *item;
+	int ret;
+
+	if (!*pg_out) {
+		pg = vmalloc(sizeof(*pg));
+		if (!pg)
+			return -ENOMEM;
+
+		memset(pg, 0, sizeof(*pg));
+		*pg_out = pg;
+	} else {
+		pg = *pg_out;
+	}
+
+	while ((item = strsep(&string_scan, "~")) != NULL) {
+		const char *range_sep = strchr(item, '-');
+
+		if (range_sep != NULL) {
+			const char *range_start = strsep(&item, "-");
+			BUG_ON(range_start == NULL || item == NULL);
+			ret = nfs_portgroup_add_range(fc, pg, type,
+						       range_start, item);
+		} else {
+			ret = nfs_portgroup_add_single(fc, pg, type, item);
+		}
+
+		if (ret)
+			return ret;
+	}
+
+	if (pg->nr == 0)
+		return nfs_invalf(fc, "NFS: passed empty portgroup is invalid");
+
+	return 0;
+}
+
 /*
  * Parse a single mount parameter.
  */
@@ -770,6 +914,24 @@ static int nfs_fs_context_parse_param(struct fs_context *fc,
 			goto out_invalid_value;
 		}
 		break;
+	case Opt_localports:
+		ret = nfs_parse_portgroup(param->string, fc, &ctx->localports, "local");
+
+		switch (ret) {
+		case -ENOMEM: goto out_nomem;
+		case -ENOSPC: goto out_portgroup_too_large;
+		case -EINVAL: goto out_invalid_address;
+		}
+		break;
+	case Opt_remoteports:
+		ret = nfs_parse_portgroup(param->string, fc, &ctx->remoteports, "remote");
+
+		switch (ret) {
+		case -ENOMEM: goto out_nomem;
+		case -ENOSPC: goto out_portgroup_too_large;
+		case -EINVAL: goto out_invalid_address;
+		}
+		break;
 
 		/*
 		 * Special options
@@ -782,6 +944,9 @@ static int nfs_fs_context_parse_param(struct fs_context *fc,
 
 	return 0;
 
+out_nomem:
+	nfs_errorf(fc, "NFS: not enough memory to parse device name");
+	return -ENOMEM;
 out_invalid_value:
 	return nfs_invalf(fc, "NFS: Bad mount option value specified");
 out_invalid_address:
@@ -790,6 +955,8 @@ static int nfs_fs_context_parse_param(struct fs_context *fc,
 	return nfs_invalf(fc, "NFS: Value for '%s' out of range", param->key);
 out_bad_transport:
 	return nfs_invalf(fc, "NFS: Unrecognized transport protocol");
+out_portgroup_too_large:
+	return -EINVAL;
 }
 
 /*
@@ -1394,6 +1561,10 @@ static void nfs_fs_context_free(struct fs_context *fc)
 		if (ctx->nfs_mod)
 			put_nfs_version(ctx->nfs_mod);
 		kfree(ctx->client_address);
+		if (ctx->localports)
+			vfree(ctx->localports);
+		if (ctx->remoteports)
+			vfree(ctx->remoteports);
 		kfree(ctx->mount_server.hostname);
 		kfree(ctx->nfs_server.export_path);
 		kfree(ctx->nfs_server.hostname);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 62d3189745cd..8efdbd896b77 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -63,6 +63,8 @@ struct nfs_client_initdata {
 	const char *nodename;			/* Hostname of the client */
 	const char *ip_addr;			/* IP address of the client */
 	size_t addrlen;
+	struct rpc_portgroup *localports;      /* Local addresses to bind */
+	struct rpc_portgroup *remoteports;     /* Remote server addresses */
 	struct nfs_subversion *nfs_mod;
 	int proto;
 	u32 minorversion;
@@ -96,6 +98,8 @@ struct nfs_fs_context {
 	char			*fscache_uniq;
 	unsigned short		protofamily;
 	unsigned short		mountfamily;
+	struct rpc_portgroup	*localports;
+	struct rpc_portgroup	*remoteports;
 
 	struct {
 		union {
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 38e60ec742df..33fd23068546 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -50,6 +50,8 @@ struct nfs_client {
 #define NFS_CS_REUSEPORT	8		/* - reuse src port on reconnect */
 	struct sockaddr_storage	cl_addr;	/* server identifier */
 	size_t			cl_addrlen;
+	struct rpc_portgroup *  cl_localports;  /* Local addresses to bind */
+	struct rpc_portgroup *  cl_remoteports; /* Remote server addresses */
 	char *			cl_hostname;	/* hostname of server */
 	char *			cl_acceptor;	/* GSSAPI acceptor name */
 	struct list_head	cl_share_link;	/* link in global client list */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v1 4/5] sunrpc: Add srcaddr to xprt sysfs debug
  2021-01-21 19:10 [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Dan Aloni
                   ` (2 preceding siblings ...)
  2021-01-21 19:10 ` [PATCH v1 3/5] nfs: Extend nconnect with remoteports and localports mount params Dan Aloni
@ 2021-01-21 19:10 ` Dan Aloni
  2021-01-21 19:10 ` [PATCH v1 5/5] nfs: Increase NFS_MAX_CONNECTIONS Dan Aloni
  2021-01-21 19:50 ` [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Chuck Lever
  5 siblings, 0 replies; 8+ messages in thread
From: Dan Aloni @ 2021-01-21 19:10 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust, Anna Schumaker

For easier inspection of transport state with regard to nconnect, with
remoteports and localports of NFS, this helps us to know what is the
source address used for each established transport.

Signed-off-by: Dan Aloni <dan@kernelim.com>
---
 include/linux/sunrpc/xprt.h                |  1 +
 net/sunrpc/debugfs.c                       |  8 ++--
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  2 +-
 net/sunrpc/xprtrdma/transport.c            | 10 ++++-
 net/sunrpc/xprtrdma/xprt_rdma.h            |  3 +-
 net/sunrpc/xprtsock.c                      | 49 ++++++++++++++--------
 6 files changed, 48 insertions(+), 25 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index d2e97ee802af..2bdf9e806964 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -43,6 +43,7 @@ struct rpc_timeout {
 
 enum rpc_display_format_t {
 	RPC_DISPLAY_ADDR = 0,
+	RPC_DISPLAY_SRC_ADDR,
 	RPC_DISPLAY_PORT,
 	RPC_DISPLAY_PROTO,
 	RPC_DISPLAY_HEX_ADDR,
diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c
index 56029e3af6ff..4c65b64077f7 100644
--- a/net/sunrpc/debugfs.c
+++ b/net/sunrpc/debugfs.c
@@ -175,9 +175,11 @@ xprt_info_show(struct seq_file *f, void *v)
 {
 	struct rpc_xprt *xprt = f->private;
 
-	seq_printf(f, "netid: %s\n", xprt->address_strings[RPC_DISPLAY_NETID]);
-	seq_printf(f, "addr:  %s\n", xprt->address_strings[RPC_DISPLAY_ADDR]);
-	seq_printf(f, "port:  %s\n", xprt->address_strings[RPC_DISPLAY_PORT]);
+	seq_printf(f, "netid:   %s\n", xprt->address_strings[RPC_DISPLAY_NETID]);
+	seq_printf(f, "addr:    %s\n", xprt->address_strings[RPC_DISPLAY_ADDR]);
+	seq_printf(f, "port:    %s\n", xprt->address_strings[RPC_DISPLAY_PORT]);
+	seq_printf(f, "srcaddr: %s\n", xprt->address_strings[RPC_DISPLAY_SRC_ADDR]
+		   ? xprt->address_strings[RPC_DISPLAY_SRC_ADDR] : "");
 	seq_printf(f, "state: 0x%lx\n", xprt->state);
 	return 0;
 }
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 63f8be974df2..93e4dcb1a807 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -261,7 +261,7 @@ xprt_setup_rdma_bc(struct xprt_create *args)
 
 	memcpy(&xprt->addr, args->dstaddr, args->addrlen);
 	xprt->addrlen = args->addrlen;
-	xprt_rdma_format_addresses(xprt, (struct sockaddr *)&xprt->addr);
+	xprt_rdma_format_addresses(xprt, (struct sockaddr *)&xprt->addr, NULL);
 	xprt->resvport = 0;
 
 	xprt->max_payload = xprt_rdma_max_inline_read;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 45726ab5f13a..0d1cce310f13 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -181,7 +181,8 @@ xprt_rdma_format_addresses6(struct rpc_xprt *xprt, struct sockaddr *sap)
 }
 
 void
-xprt_rdma_format_addresses(struct rpc_xprt *xprt, struct sockaddr *sap)
+xprt_rdma_format_addresses(struct rpc_xprt *xprt, struct sockaddr *sap,
+			   struct sockaddr *src_sap)
 {
 	char buf[128];
 
@@ -200,6 +201,11 @@ xprt_rdma_format_addresses(struct rpc_xprt *xprt, struct sockaddr *sap)
 	(void)rpc_ntop(sap, buf, sizeof(buf));
 	xprt->address_strings[RPC_DISPLAY_ADDR] = kstrdup(buf, GFP_KERNEL);
 
+	if (src_sap) {
+		(void)rpc_ntop(src_sap, buf, sizeof(buf));
+		xprt->address_strings[RPC_DISPLAY_SRC_ADDR] = kstrdup(buf, GFP_KERNEL);
+	}
+
 	snprintf(buf, sizeof(buf), "%u", rpc_get_port(sap));
 	xprt->address_strings[RPC_DISPLAY_PORT] = kstrdup(buf, GFP_KERNEL);
 
@@ -352,7 +358,7 @@ xprt_setup_rdma(struct xprt_create *args)
 
 	if (rpc_get_port(sap))
 		xprt_set_bound(xprt);
-	xprt_rdma_format_addresses(xprt, sap);
+	xprt_rdma_format_addresses(xprt, sap, args->srcaddr);
 
 	new_xprt = rpcx_to_rdmax(xprt);
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index cb4539d4740a..5fc36fdd2056 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -573,7 +573,8 @@ static inline void rpcrdma_set_xdrlen(struct xdr_buf *xdr, size_t len)
  */
 extern unsigned int xprt_rdma_max_inline_read;
 extern unsigned int xprt_rdma_max_inline_write;
-void xprt_rdma_format_addresses(struct rpc_xprt *xprt, struct sockaddr *sap);
+void xprt_rdma_format_addresses(struct rpc_xprt *xprt, struct sockaddr *sap,
+				struct sockaddr *src_sap);
 void xprt_rdma_free_addresses(struct rpc_xprt *xprt);
 void xprt_rdma_close(struct rpc_xprt *xprt);
 void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index c56a66cdf4ac..e8ea13f29dfe 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -229,6 +229,12 @@ static inline struct sockaddr *xs_addr(struct rpc_xprt *xprt)
 	return (struct sockaddr *) &xprt->addr;
 }
 
+static inline struct sockaddr *xs_srcaddr(struct rpc_xprt *xprt)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+	return (struct sockaddr *) &transport->srcaddr;
+}
+
 static inline struct sockaddr_un *xs_addr_un(struct rpc_xprt *xprt)
 {
 	return (struct sockaddr_un *) &xprt->addr;
@@ -244,9 +250,11 @@ static inline struct sockaddr_in6 *xs_addr_in6(struct rpc_xprt *xprt)
 	return (struct sockaddr_in6 *) &xprt->addr;
 }
 
-static void xs_format_common_peer_addresses(struct rpc_xprt *xprt)
+static void xs_format_common_addresses(struct rpc_xprt *xprt,
+			               struct sockaddr *sap,
+			               int display_addr,
+			               int display_hex_addr)
 {
-	struct sockaddr *sap = xs_addr(xprt);
 	struct sockaddr_in6 *sin6;
 	struct sockaddr_in *sin;
 	struct sockaddr_un *sun;
@@ -256,19 +264,19 @@ static void xs_format_common_peer_addresses(struct rpc_xprt *xprt)
 	case AF_LOCAL:
 		sun = xs_addr_un(xprt);
 		strlcpy(buf, sun->sun_path, sizeof(buf));
-		xprt->address_strings[RPC_DISPLAY_ADDR] =
+		xprt->address_strings[display_addr] =
 						kstrdup(buf, GFP_KERNEL);
 		break;
 	case AF_INET:
 		(void)rpc_ntop(sap, buf, sizeof(buf));
-		xprt->address_strings[RPC_DISPLAY_ADDR] =
+		xprt->address_strings[display_addr] =
 						kstrdup(buf, GFP_KERNEL);
 		sin = xs_addr_in(xprt);
 		snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
 		break;
 	case AF_INET6:
 		(void)rpc_ntop(sap, buf, sizeof(buf));
-		xprt->address_strings[RPC_DISPLAY_ADDR] =
+		xprt->address_strings[display_addr] =
 						kstrdup(buf, GFP_KERNEL);
 		sin6 = xs_addr_in6(xprt);
 		snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
@@ -277,7 +285,8 @@ static void xs_format_common_peer_addresses(struct rpc_xprt *xprt)
 		BUG();
 	}
 
-	xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
+	if (display_hex_addr != -1)
+		xprt->address_strings[display_hex_addr] = kstrdup(buf, GFP_KERNEL);
 }
 
 static void xs_format_common_peer_ports(struct rpc_xprt *xprt)
@@ -292,13 +301,17 @@ static void xs_format_common_peer_ports(struct rpc_xprt *xprt)
 	xprt->address_strings[RPC_DISPLAY_HEX_PORT] = kstrdup(buf, GFP_KERNEL);
 }
 
-static void xs_format_peer_addresses(struct rpc_xprt *xprt,
-				     const char *protocol,
-				     const char *netid)
+static void xs_format_addresses(struct rpc_xprt *xprt,
+			        const char *protocol,
+			        const char *netid)
 {
 	xprt->address_strings[RPC_DISPLAY_PROTO] = protocol;
 	xprt->address_strings[RPC_DISPLAY_NETID] = netid;
-	xs_format_common_peer_addresses(xprt);
+	xs_format_common_addresses(xprt, xs_addr(xprt),
+				   RPC_DISPLAY_ADDR, RPC_DISPLAY_HEX_ADDR);
+	if (xs_srcaddr(xprt)->sa_family != 0)
+		xs_format_common_addresses(xprt, xs_srcaddr(xprt),
+					   RPC_DISPLAY_SRC_ADDR, -1);
 	xs_format_common_peer_ports(xprt);
 }
 
@@ -2793,7 +2806,7 @@ static struct rpc_xprt *xs_setup_local(struct xprt_create *args)
 			goto out_err;
 		}
 		xprt_set_bound(xprt);
-		xs_format_peer_addresses(xprt, "local", RPCBIND_NETID_LOCAL);
+		xs_format_addresses(xprt, "local", RPCBIND_NETID_LOCAL);
 		ret = ERR_PTR(xs_local_setup_socket(transport));
 		if (ret)
 			goto out_err;
@@ -2860,13 +2873,13 @@ static struct rpc_xprt *xs_setup_udp(struct xprt_create *args)
 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
 			xprt_set_bound(xprt);
 
-		xs_format_peer_addresses(xprt, "udp", RPCBIND_NETID_UDP);
+		xs_format_addresses(xprt, "udp", RPCBIND_NETID_UDP);
 		break;
 	case AF_INET6:
 		if (((struct sockaddr_in6 *)addr)->sin6_port != htons(0))
 			xprt_set_bound(xprt);
 
-		xs_format_peer_addresses(xprt, "udp", RPCBIND_NETID_UDP6);
+		xs_format_addresses(xprt, "udp", RPCBIND_NETID_UDP6);
 		break;
 	default:
 		ret = ERR_PTR(-EAFNOSUPPORT);
@@ -2942,13 +2955,13 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args)
 		if (((struct sockaddr_in *)addr)->sin_port != htons(0))
 			xprt_set_bound(xprt);
 
-		xs_format_peer_addresses(xprt, "tcp", RPCBIND_NETID_TCP);
+		xs_format_addresses(xprt, "tcp", RPCBIND_NETID_TCP);
 		break;
 	case AF_INET6:
 		if (((struct sockaddr_in6 *)addr)->sin6_port != htons(0))
 			xprt_set_bound(xprt);
 
-		xs_format_peer_addresses(xprt, "tcp", RPCBIND_NETID_TCP6);
+		xs_format_addresses(xprt, "tcp", RPCBIND_NETID_TCP6);
 		break;
 	default:
 		ret = ERR_PTR(-EAFNOSUPPORT);
@@ -3006,11 +3019,11 @@ static struct rpc_xprt *xs_setup_bc_tcp(struct xprt_create *args)
 
 	switch (addr->sa_family) {
 	case AF_INET:
-		xs_format_peer_addresses(xprt, "tcp",
-					 RPCBIND_NETID_TCP);
+		xs_format_addresses(xprt, "tcp",
+				   RPCBIND_NETID_TCP);
 		break;
 	case AF_INET6:
-		xs_format_peer_addresses(xprt, "tcp",
+		xs_format_addresses(xprt, "tcp",
 				   RPCBIND_NETID_TCP6);
 		break;
 	default:
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v1 5/5] nfs: Increase NFS_MAX_CONNECTIONS
  2021-01-21 19:10 [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Dan Aloni
                   ` (3 preceding siblings ...)
  2021-01-21 19:10 ` [PATCH v1 4/5] sunrpc: Add srcaddr to xprt sysfs debug Dan Aloni
@ 2021-01-21 19:10 ` Dan Aloni
  2021-01-21 19:50 ` [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Chuck Lever
  5 siblings, 0 replies; 8+ messages in thread
From: Dan Aloni @ 2021-01-21 19:10 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust, Anna Schumaker

This matches it with the new RPC_MAX_PORTS value.

Signed-off-by: Dan Aloni <dan@kernelim.com>
---
 fs/nfs/fs_context.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c
index 3d41ba61b26d..38682c99e684 100644
--- a/fs/nfs/fs_context.c
+++ b/fs/nfs/fs_context.c
@@ -28,7 +28,7 @@
 #define NFS_DEFAULT_VERSION 2
 #endif
 
-#define NFS_MAX_CONNECTIONS 16
+#define NFS_MAX_CONNECTIONS RPC_MAX_PORTS
 
 enum nfs_param {
 	Opt_ac,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements
  2021-01-21 19:10 [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Dan Aloni
                   ` (4 preceding siblings ...)
  2021-01-21 19:10 ` [PATCH v1 5/5] nfs: Increase NFS_MAX_CONNECTIONS Dan Aloni
@ 2021-01-21 19:50 ` Chuck Lever
  2021-01-24 17:37   ` Dan Aloni
  5 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2021-01-21 19:50 UTC (permalink / raw)
  To: Dan Aloni; +Cc: Linux NFS Mailing List, Trond Myklebust, Anna Schumaker

Hey Dan-


First, thanks for posting patches!


> On Jan 21, 2021, at 2:10 PM, Dan Aloni <dan@kernelim.com> wrote:
> 
> Hi,
> 
> The purpose of the following changes is to allow specifying multiple
> target IP addresses in a single mount. Combining this with nconnect and
> servers that support exposing multiple ports,

"port" is probably a bad term to use here, as that term already
has a particular meaning when it comes to IP addresses. In
standards documents, we've stuck with the term "endpoint".

I worked with the IETF's nfsv4 WG a couple years ago to produce
a document that describes how we want NFS servers to advertise
their network configuration to clients.

https://datatracker.ietf.org/doc/rfc8587/

That gives a flavor for what we've done for NFSv4. IMO anything
done for NFSv3 ought to leverage similar principles and tactics.


> we can achieve load
> balancing and much greater throughput, especially on RDMA setups,
> even with the older NFSv3 protocol.

I support the basic goal of increasing transport parallelism.

As you probably became aware as you worked on these patches, the
Linux client shares one or a small set of connections across all
mount points of the same server. So a mount option that adds this
kind of control is going to be awkward.

Anna has proposed a /sys API that would enable this information to
be programmed into the kernel for all mount points sharing the
same set of connections. That would be a little nicer for building
separate administrator tools against, or even for providing an
automation mechanism (like an orchestrator) that would enable
clients to automatically fail over to a different server interface.

IMO I'd prefer to see a user space policy / tool that manages
endpoint lists and passes them to the kernel client dynamically
via Anna's API instead of adding one or more mount options, which
would be fixed for the life of the mount and shared with other
mount points that use the same transports to communicate with
the NFS server.


As far as the NUMA affinity issues go, in the past I've attempted
to provide some degree of CPU affinity between RPC Call and Reply
handling only to find that it reduced performance unacceptably.
Perhaps something that is node-aware or LLC-aware would be better
than CPU affinity, and I'm happy to discuss that and any other
ways we think can improve NFS behavior on NUMA systems. It's quite
true that RDMA transports are more sensitive to NUMA than
traditional socket-based ones.


> The changes allow specifing a new `remoteports=<IP-addresses-ranges>`
> mount option providing a group of IP addresses, from which `nconnect` at
> sunrpc scope picks target transport address in round-robin. There's also
> an accompanying `localports` parameter that allows local address bind so
> that the source port is better controlled in a way to ensure that
> transports are not hogging a single local interface.
> 
> This patchset targets the linux-next tree.
> 
> Dan Aloni (5):
>  sunrpc: Allow specifying a vector of IP addresses for nconnect
>  xprtrdma: Bind to a local address if requested
>  nfs: Extend nconnect with remoteports and localports mount params
>  sunrpc: Add srcaddr to xprt sysfs debug
>  nfs: Increase NFS_MAX_CONNECTIONS
> 
> fs/nfs/client.c                            |  24 +++
> fs/nfs/fs_context.c                        | 173 ++++++++++++++++++++-
> fs/nfs/internal.h                          |   4 +
> include/linux/nfs_fs_sb.h                  |   2 +
> include/linux/sunrpc/clnt.h                |   9 ++
> include/linux/sunrpc/xprt.h                |   1 +
> net/sunrpc/clnt.c                          |  47 ++++++
> net/sunrpc/debugfs.c                       |   8 +-
> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   2 +-
> net/sunrpc/xprtrdma/transport.c            |  17 +-
> net/sunrpc/xprtrdma/verbs.c                |  15 +-
> net/sunrpc/xprtrdma/xprt_rdma.h            |   5 +-
> net/sunrpc/xprtsock.c                      |  49 +++---
> 13 files changed, 329 insertions(+), 27 deletions(-)
> 
> -- 
> 2.26.2
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements
  2021-01-21 19:50 ` [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Chuck Lever
@ 2021-01-24 17:37   ` Dan Aloni
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Aloni @ 2021-01-24 17:37 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List, Trond Myklebust, Anna Schumaker

On Thu, Jan 21, 2021 at 07:50:41PM +0000, Chuck Lever wrote:
> I worked with the IETF's nfsv4 WG a couple years ago to produce
> a document that describes how we want NFS servers to advertise
> their network configuration to clients.
> 
> https://datatracker.ietf.org/doc/rfc8587/
> 
> That gives a flavor for what we've done for NFSv4. IMO anything
> done for NFSv3 ought to leverage similar principles and tactics.
 
Thanks for the pointer - I'll read and take it into consideration.

> > we can achieve load
> > balancing and much greater throughput, especially on RDMA setups,
> > even with the older NFSv3 protocol.
> 
> I support the basic goal of increasing transport parallelism.
> 
> As you probably became aware as you worked on these patches, the
> Linux client shares one or a small set of connections across all
> mount points of the same server. So a mount option that adds this
> kind of control is going to be awkward.

I tend to agree, from a developer perspective, but just to give an
idea that from an admin POV it is often is not immediately apparent that
this is what happens behind the scenes (i.e. the `nfs_match_client`
function), so in our case the users have not reported back that our
addition to the mount parameters looked weird, considering it as
naturally extending nconnect, which I think falls under similar
considerations - giving deeper details regarding how transports should
behave during the mount command and not afterwards, regarding what
actual NFS sessions are established.

Surely there may be better ways to do this, following from what's
discussed next.

> Anna has proposed a /sys API that would enable this information to
> be programmed into the kernel for all mount points sharing the
> same set of connections. That would be a little nicer for building
> separate administrator tools against, or even for providing an
> automation mechanism (like an orchestrator) that would enable
> clients to automatically fail over to a different server interface.
>
> IMO I'd prefer to see a user space policy / tool that manages
> endpoint lists and passes them to the kernel client dynamically
> via Anna's API instead of adding one or more mount options, which
> would be fixed for the life of the mount and shared with other
> mount points that use the same transports to communicate with
> the NFS server.

I see now that these are fairly recent patches that I've unfortunately
missed while working on other things. If this is the intended API to
help manage active NFS sessions, I would very much like to help on
testing and extending this code.

So a good way to go with this would be to look into supporting an 'add
transport' op by extending on the new interface, and for optionally
specifying local address bind similarly to the work I've done for the
mount options.

I'll also be glad to contribute to nfs-utils so that we'd have the
anticipated userspace tool, maybe 'nfs' (like `/sbin/ip` from iproute),
that can executed for this purpose, e.g. 'nfs transport add <IP> mnt
<PATH>'.

Also, from a lower level API perspective, we would need a way to figure
out client ID from a mount point, so that ID can be used at the relevant
sysfs directory. Perhaps this can be done via a new ioctl on the
mount point itself?

> As far as the NUMA affinity issues go, in the past I've attempted
> to provide some degree of CPU affinity between RPC Call and Reply
> handling only to find that it reduced performance unacceptably.
> Perhaps something that is node-aware or LLC-aware would be better
> than CPU affinity, and I'm happy to discuss that and any other
> ways we think can improve NFS behavior on NUMA systems. It's quite
> true that RDMA transports are more sensitive to NUMA than
> traditional socket-based ones.

Also to consider that RDMA is special for this as CPU memory caching can
be skipped, and even main memory - for example a special case where the
NFS read/write payload memory is not the main system memory but mapped
from PCI, and the kernel's own PCI_P2PDMA distance matrix can be used
for better xprt selection.

-- 
Dan Aloni

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-24 17:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21 19:10 [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Dan Aloni
2021-01-21 19:10 ` [PATCH v1 1/5] sunrpc: Allow specifying a vector of IP addresses for nconnect Dan Aloni
2021-01-21 19:10 ` [PATCH v1 2/5] xprtrdma: Bind to a local address if requested Dan Aloni
2021-01-21 19:10 ` [PATCH v1 3/5] nfs: Extend nconnect with remoteports and localports mount params Dan Aloni
2021-01-21 19:10 ` [PATCH v1 4/5] sunrpc: Add srcaddr to xprt sysfs debug Dan Aloni
2021-01-21 19:10 ` [PATCH v1 5/5] nfs: Increase NFS_MAX_CONNECTIONS Dan Aloni
2021-01-21 19:50 ` [PATCH v1 0/5] NFSv3 client RDMA multipath enhancements Chuck Lever
2021-01-24 17:37   ` Dan Aloni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.